ChapterPDF Available

Abstract and Figures

This chapter describes a general approach for image classification using Genetic Programming (GP) and demonstrates this approach through the application of GP to the task of stage 1 cancer detection in digital mammograms. We detail an automated work-flow that begins with image processing and culminates in the evolution of classification models which identify suspicious segments of mammograms. Early detection of breast cancer is directly correlated with survival of the disease and mammography has been shown to be an effective tool for early detection, which is why many countries have introduced national screening programs. However, this presents challenges, as such programs involve screening a large number of women and thus require more trained radiologists at a time when there is a shortage of these professionals in many countries.Also, as mammograms are difficult to read and radiologists typically only have a few minutes allocated to each image, screening programs tend to be conservative—involving many callbacks which increase both the workload of the radiologists and the stress and worry of patients.Fortunately, the relatively recent increase in the availability of mammograms in digital form means that it is now much more feasible to develop automated systems for analysing mammograms. Such systems, if successful could provide a very valuable second reader function.We present a work-flow that begins by processing digital mammograms to segment them into smaller sub-images and to extract features which describe textural aspects of the breast. The most salient of these features are then used in a GP system which generates classifiers capable of identifying which particular segments may have suspicious areas requiring further investigation. An important objective of this work is to evolve classifiers which detect as many cancers as possible but which are not overly conservative. The classifiers give results of 100 % sensitivity and a false positive per image rating of just 0.33, which is better than prior work. Not only this, but our system can use GP as part of a feedback loop, to both select and help generate further features.
Content may be subject to copyright.
Chapter 10
Image Classification with Genetic
Programming: Building a Stage 1 Computer
Aided Detector for Breast Cancer
Conor Ryan, Jeannie Fitzgerald, Krzysztof Krawiec, and David Medernach
10.1 Introduction
Image Classification (IC) is concerned with automatically classifying images based
on their features, which are typically some sort of measurable/quantifiableproperty,
such as brightness, interest points, etc. The term “feature” can have several
meanings in Pattern Recognition (PR) and Machine Learning (ML) where it may
be defined as either a location in an image that is relevant with respect to some
classification/detection/analysis task or simply as a scalar value extracted from an
image. For this work we adopt the latter meaning.
IC has been applied in fields as diverse as medicine (Petrick et al. 2013),
military (Howard et al. 2006), security (Xie and Shang 2014), astronomy (Riess
et al. 1998)andfoodscience(Tanetal.2000). Part of the success of IC stems from
the fact that the same key steps are applied regardless of the application domain.
This chapter describes IC in detail using mammography as a test problem.
Statistics produced by the Organisation for Economic Co-operation and Devel-
opment (OECD) highlight the importance of the early detection of breast cancer,
both in terms of extending the longevity of women and in reducing financial costs.
Routine mammographic screening,particularly at a national level, is by far the most
Electronic supplementary material: The online version of this chapter (doi: 10.1007/978-3-319-
C. Ryan (!)•J.Fitzgerald•D.Medernach
University of Limerick, Limerick, Ireland
K. Krawiec
University of Pozna´
n, Pozna´
n, Poland
A.H. Gandomi et al. (eds.), Handbook of Genetic Programming Applications,
DOI 10.1007/978-3-319-20883-1_10
246 C. Ryan et al.
effective tool for the early detection and subsequent successful treatment of breast
cancer (Tot et al. 2000; Tabar et al. 2000; Smith et al. 2012). It is essential to discover
signs of cancer early, as survival is directly correlated with early detection (Tabar
et al. 2000).
Screening is usually performed on asymptomatic women over a certain age
(e.g. over 50 in many European countries) at regular periods, typically every 2 or 3
years. In national mammography screening, radiologists examine the mammograms
of thousands of women (typically having only a few minutes to examine each image)
to determine if there are early signs of a cancerous growth or a lesion that may
require further examination.
The introduction of screening programs has contributed to a higher demand
for radiologists and a world wide shortage of qualified radiologists who choose
mammography as their area of specialisation (Bhargavan et al. 2002), particularly
in the USA, has led to many radiologists being dangerously overworked (Berlin
2000). This is likely to lead to (i) there being insufficient time for radiologists to
read and interpret mammograms (mammograms are notoriously difficult to read);
(ii) an inability to provide redundant readings (more than one radiologist checking
each mammogram); and (iii) radiologists being overly conservative, which in turn
is likely to increase the number of patient call backs, thus resulting in unnecessary
biopsies. This can lead to anxiety and mistrust of the system such that patients
become disillusioned with the process and less inclined to participate. This work
aims to improve the early detection of true positives by evolving detectors which,
although accurate, are not overly conservative.
If breast cancer is diagnosed, further tests are usually carried out to determine
the extent of the cancer. The disease is then assigned a “stage” depending on
characteristics such size of the tumour, whether the cancer is invasive or non-
invasive, whether lymph nodes are involved, and whether the cancer has spread
to other areas of the body. These stages are numbered 0, 1, 2, 3 and 4, and there
are various sub-stages in between. At stage 0 the cancer is localised and there is
no evidence of cancerous cells outside the original site, while at stage 4 the cancer
has spread to other organs of the body. According to the OECD, 75 % of patients
diagnosed with breast cancer at Stage 0 are said to have close to 100% survival
rate, while at Stage 4 the survival rates drop between 20 and 40%. Treatment cost
is six times more when a diagnosis is made at Stage 4 than at Stage 0 (Hughes and
Jacobzone 2003). There is a large body of scientific evidence supporting the view
that mammography is currently the strongest tool available in the fight against breast
cancer (Kopans 2003; Tabar et al. 2000).
A stage 1 detector examines mammograms and highlights suspicious areas that
require further investigation. A too conservative approach degenerates to marking
every mammogram (or segment of) as suspicious, while missing a cancerous area
can be disastrous.
Variou s s t u d i e s (A n t t in e n e t a l . 1993;Ciattoetal.2005)haveshownthat
second (redundant) reader functionality has a valuable role to play in breast cancer
detection, offering increases in detection rates of between 4.5 and 15 % together
with the possibility of discovering cancers at an earlier stage (Thurfjell et al. 1994).
However, due to shortages of qualified personnel in several countries and the extra
10 Image Classification with Genetic Programming 247
costs involved, the use of independent second readers in large scale screening
programs is not always possible. Thus, the availability of a reliable automated
stage 1 detector would be a very useful and cost effective resource.
We describe a fully automated work-flow for performing stage 1 breast cancer
detection with GP (Koza 1990) as its cornerstone. Mammograms are by far the most
widely used method for detecting breast cancer in women, and its use in national
screening can have a dramatic impact on early detection and survival rates. With the
increased availability of digital mammography, it is becoming increasingly more
feasible to use automated methods to help with detection.
Our work-flow positions us right at the data collection phase such that we
generate textural features ourselves. These are fed through our system, which
performs feature analysis on them before passing the ones that are determined to be
most salient on to GP for classifier generation. The best of these evolved classifiers
produces results of 100% sensitivity and a false positive per image rating of just
0.33, which is better than prior work. Our system can use GP as part of a feedback
loop, to both select existing features and to help extract further features. We show
that virtually identical work-flows (just with different feature extraction methods)
can be applied to other IC tasks.
The following section provides a background to the work and outlines some of
the important existing research, while Sect. 10.3 demonstrates how our proposed
work-flow moves from raw mammograms to GP classifiers. The specifics of the GP
experiments are detailed in Sect. 10.4 and the results are in Sect. 10.5.Wenishwith
the conclusions and future work in Sect. 10.6.
10.2 Background
Image analysis with classification is a broad research area and there is a plethora
of GP literature on the topic, from early work such as Koza (1993), Tackett (1993),
Andre (1994) to more recent studies such as Bozorgtabar and Ali Rezai Rad (2011),
Fu et al. (2014), and Langdon et al. (2014). Describing the full breath of the
research is far beyond the scope of this article. Thus, we direct the interested
reader to Krawiec et al. (2007)forareviewofGPforgeneralimageanalysis,
and we choose to focus here on the most relevant aspects in the current context:
classification, object detection and feature extraction,detection and selection.
In early work on image analysis, Poli (1996) presented an approach based on the
idea of using GP to evolve effective image filters. They applied their method to the
problem of segmentation of the brain in pairs of Magnetic Resonance images and
reported that their GP system outperformed Neural Networks (NNs) on the same
Agnelli et al. (2002) demonstrated the usefulness of GP in the area of document
image understanding and emphasised the benefits of the understandability of GP
solutions compared with those produced by NNs or statistical approaches. In other
work Zhang et al. (2003)proposedadomain independent GP method for tackling
object detection problems in which the locations of small objects of multiple classes
248 C. Ryan et al.
in large images must be found. Their system applied a “moving window” and used
pixel statistics to construct a detection map.Theyreportedcompetitiveresultswith
Several novel fitness functions were investigated in Zhang and Lett (2006)
together with a filtering technique applied to training data, for improving object
localisation with GP. They reported fewer false alarms and faster training times for
The suitability of different search drivers was also studied by Krawiec (2015)
who examined the effectiveness of several different fitness functions applied the
problem of detection of blood vessels in ophthalmology imaging. Another example
where preprocessing of training samples proved effective can be seen in Ando and
Nagao (2009) where training images were divided into sub-populations based on
predefined image characteristics.
An alternative approach, which leveraged the natural ability of evolutionary
computation to perform feature selection, was suggested by Komosi´
nski and
Krawiec (2000)whodevelopedanovelGAsystemwhichweighted selected
features. They reported superior results when their approach was used for detec-
tion of central nervous system neuroepithelial tumours. In related work Krawiec
(2002) investigated the effectiveness of a feature construction approach with GP,
where features were constructed based on a measure of utility determined by
their perceived effectiveness according to decision tree induction. The reported
feature construction approach significantly outperformed standard GP on several
classification benchmarks.
A grammar guided GP approach was used to locate the common carotid artery in
ultrasound images in Benes et al. (2013), and this approach resulted in a significant
improvement on the state of the art for that task.
Details of other object detection research of note may be found in for exam-
ple Robinson and McIlroy (1995), Benson (2000), Howard et al. (2002), Isaka
(1997), Zhang and Lett (2006), and Trujillo and Olague (2006).
A thorough review of feature detection approaches in the general literature can
be found in Tuytelaars and Mikolajczyk (2008). In the field of GP, a wide variety
of different types of features have been used to guide classification. “Standard”
approaches include first, second and higher order statistical features which may be
local (pixel based) (Howard et al. 2006) or global (area based) (Lam and Ciesielski
2004). Wavelets have been employedin various work includingChen and Lu (2007)
and Padole and Athaide (2013). Texture features constructed from pixel grey levels
were used in Song et al. (2002)todiscriminatesimpletextureimages.Cartesian
GP was used to implement Transfo rm based Ev olv able Fea tures (TEFs) in Kowaliw
et al. (2009), which were used to evolve image transformations: an approach which
improved classification of Muscular Dystrophy in cell nuclei by 38 %overprevious
methods (Zhang et al. 2013).
Recently, local binary patterns (LBP) were successfully used with GP for
anomaly detection in crowded scenes (Xie and Shang 2014). LBPs were also
previously used in, for example Al-Sahaf et al. (2013)andOliveretal.(2007).
With regard to feature extraction and classification, Atkins et al. (2011)alsosug-
gested a domain independent approach to image feature extraction and classification
10 Image Classification with Genetic Programming 249
where each individual was constructed using a three tier architecture, where each
tier was responsible for a specific function: classification, aggregation, and filtering.
This work was later developedto use a two tier architecture in Al-Sahaf et al. (2012).
These researchers demonstrated that their automated system performed as well as a
baseline GP-based classifier system that used human-extracted features. A review of
pattern recognition approaches to cancer diagnosis is presented in Abarghouei et al.
(2009), where the researchers reported competitive performance of GP on feature
extraction when compared with other machine learning (ML) algorithms and feature
extraction algorithms.
A multi-objective GP (MOGP) approach to feature extraction and classification
was recently adopted in Shao et al. (2014) which constructed feature descriptors
from low-level pixel primitives and evaluated individual performance based on
classification accuracy and tree complexity. They reported superior performance of
their method when compared with both a selection of hand-crafted approaches to
feature extraction and several automated ML systems.
Of special note are the Hybrid Evolutionary Learning for Pattern Recognition
(HELPR) (Rizki et al. 2002)andCellNet(Kharmaetal.2004)systems,bothof
which aspire to being fully autonomous pattern recognisers. HELPR combines
aspects of evolutionary programming, genetic programming, and genetic algo-
rithms (GAs) whereas CellNet employs a co-evolutionary approach using GAs.
10.2.1 Performance Metrics
In classification the true positive rate (TPR) is the proportion of positive instances
which the radiologist or learning system classifies as positive, and the false positive
rate (FPR) is the proportion of instances actually belonging to the negative class that
are misclassified as positive. In the classification task of discriminating cancerous
from non-cancerous instances, the objective is to maximize the TPR while at the
same time minimizing the FPR. The TPR is of primary importance as the cost of
missing a cancerous case is potentially catastrophic for the individual concerned.
However, it is also very important to reduce the FPR as much as possible due to the
various issues associated with false alarms, as outlined in Sect. 10.1.Intheliterature,
when a classification task involves image processing, the number of false positives
per image (FPPI) is usually reported. This is the number of false positives divided
by the number of images.
In classification literature the TPR is often referred to as sensitivity or recall,
whereas specificity is a term used to describe the true negative rate (TNR) and
FPR D1!specificity.
The Receiver Operating Characteristic (ROC) is a tool which originates from
World War II where it was used to evaluate the performance of radio personnel at
accurately reading radar images. These days it is sometimes used to measure the
performance of medical tests, radiologists and classifiers. It can also be used to
examine the balance between the TPR and FPR as the decision threshold is varied.
250 C. Ryan et al.
Fig. 10.1 Comparison of
ROC curves Comparison Of ROC Curves
0 0.1 0.2 0.3
False Positive Rate
True Positive Rate
0.4 0.5 0.6 0.7 0.8 0.9 1
In a ROC curve, the TPR is plotted against the FPR for different cut-off
points. Each point on the resulting plot represents a sensitivity/specificity pair
corresponding to a particular decision threshold. A “perfect” classifier will have a
ROC curve which passes through the upper left corner of the plot which represents
100 %sensitivityand100 %specicity.ThereforetheclosertheROCcurveisto
the upper left corner, the higher the overall accuracy of the classifier (Zweig and
Campbell 1993), whereas a curve that splits the plot exactly across the diagonal is
equivalent to random guessing. This is illustrated in Fig.10.1.
The area under the ROC curve, known as the AUC is a scalar value which
captures the accuracy of a classifier. The AUC is a non-parametric measure
representing ROC performance independent of any threshold (Brown and Davis
2006). A perfect ROC will have an AUC of 1, whereas the ROC plot of a random
classifier will result in an AUC of approximately 0:5.
In this work, we report the TPR, FPR, FPPI and AUC for each of the various
configurations of mammographicimage data included in our work-flow.
10.2.2 Mammography
A mammogram is a low-energy X-ray projection of a breast which is performed by
compressing the breast between two plates which are attached to a mammogram
machine: an adjustable plate on top with a fixed x-ray plate underneath. An image
is recorded using either X-ray film or a digital detector located on the bottom plate.
The breast is compressed to prevent it from moving, and to make the layer of breast
tissue thinner.
Two views of each breast are recorded: the craniocaudal (CC) view, which is a top
down view, and the mediolateral oblique (MLO) view, which is a side view taken at
an angle. See Fig. 10.2 for examples of each view. Functional breast tissue is termed
parenchyma and this appears as white areas on a mammogram,while the black areas
are composed of adipose (non-functioning fatty) tissue which is transparent under
10 Image Classification with Genetic Programming 251
Fig. 10.2 Mammograms. On the left is the MLO view, with benign micro-calcificat ions magnified,
while in the middle is the CC view, with a cancerous mass magnified. Notice the extra information
in the background of the image, such as view labels. On the right is the same CC view divided
into segments; each segment is examined separately for suspicious areas by the method proposed
in this chapter
Mammographic images are examined by radiologists who search for masses
or architectural distortions.AmassisdenedinAmericanCollegeofRadiology
(2003) as a space-occupying lesion that can be seen in at least two views.
Architectural distortion is defined as an alteration in the direction of a normal area
of the breast, such that it appears straight, pulled in, wavy or bumpy (Lattanzio
et al. 2010). Mammograms often also contain micro-calcifications, which are tiny
deposits of calcium that show up as bright spots in the images. With the exception of
very dense breasts, where calcifications can be obscured, it is generally accepted that
compared with other abnormalities, micro-calcifications are easier to detect both
visually and by machine due to their bright and distinctive appearance and the fact
that they are intrinsically very different from the surrounding tissue. Also, micro-
calcifications are usually, but not always, benign.
Depending on the machine used for the mammogram, the resulting image is
stored either as a plastic sheet of film or as an electronic image. Many machines
in use today produce digital mammograms. With digital mammograms, the original
images can be magnified and manipulated in different ways on a computer screen.
Several studies have also found that digital mammograms are more accurate in
finding cancers in women under the age of fifty, in peri-menopausal women, and in
women with dense breast tissue (Pisano et al. 2005). Most importantly, the advent of
digital mammography opens up hugeopportunities for the development of computer
aided analysis of mammograms.
252 C. Ryan et al.
10.2.3 Computer-Aided Detection of Mammographic
Variou s l evel s o f a ut o m a ti o n e x i st i n m a m m o gr a p hy, an d th es e c an gen e r a l l y be
divided into Computer-Aided Detection (CAD) and Computer-Aided Diagnosis
(CADx) (Sampat and Bovik 2010). In this work we concentrate exclusively on
CAD, in particular, what is known as Stage 1 detection.
In 1967 Winsberg et al. (1967) developed a system for automated analysis of
mammograms. However, it was not until the late 1980s that improved digitisation
methods and increases in computer power made the development of potentially
useful CAD and CADx systems feasible. Since then, a large body of research has
been undertaken on the topic, with many research groupscurrently active in the area
A typical work-flow for a computer-aided system is shown in Fig.10.3.Thefirst
stage of CAD is to detect suspicious regions, which are then examined by more
specialised routines in the second stage. The outputof this stage is a set of Regions of
Interest (ROIs) which are passed either to a radiologist or to a CADx system which
outputs the likelihood of malignancy. The involvement of radiologists and/or later
stages obviates the need for a perfectly understandable system, as any diagnostic
action is ultimately determined by them.
As with many medical applications, mammography demands near-perfection,
particularly in the identification of True Positives (TPs), where the true positive
rate (TPR) is measured as the percentage of test cases containing cancerous areas
identified. In general, Stage 1 detectors are quite conservative (Sampat and Bovik
is, the number of areas from an image that are incorrectly identified as having
cancerous masses.
While an important function of stage 2 detectors is to reduce the FPPI in the
output produced by the Stage 1 detector, the rate of FPPI can have an impact on
the speed and quality of stage 2 detectors, as a too-conservative approach will
degenerate to returning virtually every image. Although this would return a perfect
TPR, the FPPI rate would render the system virtually useless.
The potential for CAD to improve screening mammography outcomes by
increasing the cancer detection rate has been shown in several retrospective
studies Vyborny (1994), Brake et al. (1998), Nishikawa et al. (1995), and
Fig. 10.3 Atypicalflowchartforcomputeraideddetectionanddiagnosis.Stage1oftheprocess
aims to detect suspicious areas with high sensitivity, while Stage 2 tries to reduce the number of
suspicious lesions without compromising sensitivity
10 Image Classification with Genetic Programming 253
Warren Burhenne and D’Orsi (2002). A more recent study (Cupples et al. 2005)
reported an overall increase of 16 % in the cancer detection rates using CAD
together with traditional detection methods. In this study, CAD increased the
detection rate of small invasive cancers (1cm or less) by 164%. The study
concluded that “increased detection rate, younger age at diagnosis, and significantly
earlier stage of invasive cancer detection are consistent with a positive screening
impact of CAD”.
In general, most automated approaches to mammography divide the images into
segments (Sampat and Bovik 2010) on which further analysis is undertaken. Each
segment is examined for signs indicative of suspicious growths. This work takes a
radically different approach by considering textural asymmetry across the breasts
and between segments of the same breast as a potential indicator for suspicious
areas. This is a reasonable approach because, although breasts are generally
physically (in terms of size) asymmetrical,theirparenchymal patterns (i.e., their
mammographic appearance) and, importantly, the texture of their mammograms,
are typically relatively uniform (Tot et al. 2000).
Density of breast tissue is an important attribute of the parenchyma and it has
been established that mammograms of dense breasts are more challenging for
human experts. At the same time, repeated studies have demonstrated that women
with dense tissue in greater than 75 % of the breast are 4–6 times more likely to
develop breast cancer compared with women with little to no breast density (Boyd
et al. 1995,1998; Byrne et al. 2001; McCormack and Santos Silva 2006). Douglas
et al. (2008) highlighted a correlation between genetic breast tissue density and
other known genetic risk factors for breast cancer, and concluded that the “shared
architecture” of these features should be studied further. Given the importance
of parenchymal density as a risk factor and the difficulty for human experts in
identifying suspicious areas in this challenging environment, we believe that a
stage 1 detector which focuses on textural asymmetry may have a strong decision
support role to play in the identification of suspicious mammograms.
10.2.4 Feature Detection, Selection and Extraction
Feature detection, feature selection and feature extraction are crucial aspects of any
image analysis or classification task. This importance is reflected in the volume of
research that has been undertaken on the subject. Feature detection involves the
extraction of possibly interesting features from image data, with a view to using
them as a starting point to guide some detection or classification task. The objectives
of feature selection are the extraction from a potentially large set of detected features
those features that are most useful, in terms of discrimination, for the particular
purpose and also for determining which combinations of features may work best.
Finally, feature extraction is the process of extracting from detected features the
non-redundant meaningful information that will inform a higher level task such
254 C. Ryan et al.
as classification. This may involve reducing the number of features or combining
features (or aspects thereof) to form new, more compact or useful features.
Mammograms are large (the images in this work are of the order 3600 "5600
pixels) grey-scale images, but only minute parts of them contain diagnostically
relevant information. Therefore,a detection process typically relies on the existence
of features, which describe various properties of the image. Features are typically
extracted using either area- or pixel-based measures. In this work we focus
exclusively on area-based features as they are better suited to the identification of
ROIs (because the images are so large) than their pixel-based counterparts, which
are best suited for highly localized search. Section 10.3 below describes the features
10.2.5 Related Work
Although several CAD systems already exist, most are Stage 2 detectors (Sampat
and Bovik 2010)andfocusonparticularkindsofmasses,e.g.spiculatedlesions.Of
the more general systems, the best reported appears to be that of Ryan et al. (2014)
which reports a best TPR of 100% with an FPPI of just 1.5. Other good results
were produced by Li et al. (2001)with97.3%TPRand14.81FPPI.Similarwork
by Polakowski et al. (1997)hadalowerTPR(92%)butwithamuchlowerFPPI
rate (8.39).
There has been a great deal of research undertaken in the area of detection
and classification of micro-calcifications. Various approaches to feature detection
have been proposed including texture features, gray level features (Dhawan et al.
1996), wavelet transforms (Strickland and Hahn 1996), identification of linear
structures (Wu et al. 2008)andvariousstatisticalmethods.In2004Soltanian-
Zadeh et al. (2004) undertook a comparison of the most popular features used
for micro-calcification detection including texture, shape and wavelet features.
They concluded that the multi-wavelet approach was superior for the particular
purpose. In more recent work using Cartesian GP, micro-calcifications were targeted
by Volk et al. (2009), in a CADx application, where they took 128 "128 pixel
segments, each of which contained at least one micro-calcification and predicted
the probability of it being malignant.
For the objectives of mass segmentation and detection, image features which
capture aspects of shape (Rangayyan et al. 1997), edge-sharpness (Mudigonda et al.
2000)andtexture(BovisandSingh2000) are frequently used. Nandi et al. (2006)
reported a classification accuracy of 98 % on test data when using a combination of
all three of these feature types. In that work, the researchers examined a database
of 57 images, each of which already had 22 features detected, and used GP in
combination with various feature selection methods to reduce the dimensionality
of the problem.
Varying numbers of Haralick texture features were used in Woods (2008)totrain
a NN classifier to detect cancerous lesionsin contrast enhanced magnetic resonance
10 Image Classification with Genetic Programming 255
imaging (DCE-MRI) for both breast and prostate cancer. The results of that study
showed that the proposed approach produced classifiers which were competitive to
a human radiologist. Learning Paradigms
Various research paradigms such as neural networks (Papadopoulos et al. 2005),
fuzzy logic (Cheng et al. 2004) and Support Vector Machines (SVM) Dehghan
et al. (2008), Cho et al. (2008)havebeenappliedtotheproblem.Inareview
of various ML approaches for detecting micro-calcifications, Sakka et al. (2006)
concluded that neural networks showed the most promise of the methods studied. In
a contemporary review, Alanís-Reyes et al. (2012) employed feature selection using
a GA and then compared the classification performance of various ML algorithms
in classifying both micro-calcifications and other suspicious masses, using these
features. Their results showed that SVM produced the best overall performance.
Given the success of GP in finding solutions to a wide range of problems,
it is not surprising that the approach has been applied to problems relating to
mammography.Quite a lot of the GP research effort has successfully demonstrated
feature selection and classification of micro-calcifications and masses as either
benign or malignant (Zheng et al. 1999; Nandi et al. 2006; Verma and Zhang 2007;
Sánchez-Ferrero and Arribas 2007; Hernández-Cisneros et al. 2007). In this work
the feature detection task is generally not handled by the genetic programs.
A genetic algorithm was used for feature selection in Sahiner et al. (1996), where
a very large number of initial features were reduced to a smaller set of discriminative
ones and then passed to either a NN or a linear classifier.
Other notable research using GP is that Ahmad et al. (2012)whodesigned
a Stage 2 cancer detector for the well known Wisconsin Breast Cancer dataset,
in which they used the features extracted from a series of fine needle aspira-
tions (FNAs) and an evolved neural network. Ludwig and Roos (2010)usedGP
to estimate the prognosis of breast cancer patients from the same data set, initially
using GP to reduce the number of features, before evolving predictors. Langdon
and Harrison (2008) took a different approach, using biopsy gene chip data, but
their system approached a similar level of automation.
Current work in mammography has been concerned with a combination of
feature selection and classification (Ganesan et al. 2013). One such approach
suggested by Ryan et al. (2014)reportsabestTPRof100%withanFPPIofonly
1.5. In other work, the best reported appears to be that of Li et al. (2001)which
delivers a 97.3% TPR with 14.81 FPPI. Similar work by Polakowski et al. (1997)
reported a lower TPR (92 %) but with a much lower FPPI rate (8.39). The standard
method of reporting results is the TP/FPPI breakdown, which is what we will also
present here.
See Petrick et al. (2013) for an evaluation of the current state-of-the-art of
computer-aided detection and diagnosis systems. In other work, Worzel et al. (2009)
reported favourably on the application of GP in cancer research generally.
256 C. Ryan et al.
Most systems operate onlyat the Classification stage, although more recent work
also considers Feature Selection. As we generate our own features, we can modify
and parameterize them based on the analysis of our classifiers. While the focus of
this chapter is on the classification system, because we extract the features from
the images ourselves, GP will eventually form part of a feedback loop, instructing
the system about what sorts of features are required. See Sect.10.6 for more details
on this.
Most previous work relies upon previously extracted features, and all the
previous work mentioned above deals with a single breast in isolation (although
using segmentation and multiple views). Our work leverages the research by Tot
et al. (2000) which indicates that, in general, both breasts from the same patient
have the same textural characteristics. Our hypothesis is that breasts of the same
patient that differ texturally may contain suspicious areas.
In summary, the unique features of our approach are that we do not focus only on
a single breast but address the problem by considering textural asymmetry across
the breasts as well as between segments of the same breast and we do not confine
our efforts simply to the classification step—rather we adopt an end-to-end strategy
which focuses on area-based features and incorporates feature extraction, detection
and selection.
10.3 Workflow
Part of the challenge in a project like this is to choose how to represent the
data. A typical mammogram used in this study is 3575 "5532 pixels and 16
bit gray-scale, which is a challenging volume of data to process. The following
work-flow was created. Steps 1–5 are concerned with the raw images, while steps
template for similar tasks, where steps 1 and 2 can be replaced with domain specific
1. Background suppression
2. Image segmentation
3. Feature detection
4. Feature selection
5. Dataset construction
6. Model development
7. Model testing
10.3.1 Background Suppression
Figure 10.2 shows that much of the images consist of background, and clearly,
this must first be removed before calculating segments and extracting features.
10 Image Classification with Genetic Programming 257
Removing the background is a non-trivial task, partly because the non-uniformity
of breast size across patients, but also because of the difficulty in taking consistent
mammograms. Due to the pliable nature of the breasts and the way in which the
mammograms are photographed (by squeezing the breast between two plates), the
same breast photographed more than once on the same machine (after a reset) may
look different.
The background of the mammographic image is never perfectly homogeneous,
and it includes at least one tag letter indicating if the image is either a right or
left breast. This is sometimes augmented by a string of characters indicating which
view (CC or MLO) is depicted. It is necessary to remove this background detail and
replace it with a homogeneous one so that the image can be properly processed at a
later stage.
Our first attempt was based on the Canny Edge Detector, but, although this
method is efficient on raw imagery, the mammograms we dealt with had been
processed to increase the contrast within the breast (to make them easier to read,
but which has the side effect of reducing the contrast between the edge of the breast
and the background). Canny Edge Detection revealed itself to be less efficient on
these images.
Our most efficient technique was to use a threshold (average of the median pixel
value and the average pixel value) such that any pixel .px;py/above the threshold
level of was kept, i.e. 202D.x!px/2C.y!py/2;x<px.Weusedlocalthresholding
with the threshold defined as an average of mean and median, calculated from each
pixel’s circular neighbourhood of radius 20. We scan each horizontal line right to
left. Once three consecutive pixels are brighter than the threshold calculated in the
above way, those pixels and the pixels to the left of it are considered as belonging
to the breast.
10.3.2 Image Segmentation
Our approach is to divide each image into three segments, and to examine each
segment separately. As there can be more than one suspicious area in an image,
we return true for as many segments as the systems finds suspicious, meaning that
a single mammogram can have several positives returned. With Stage 1 detectors
such as ours, this is described by the FPPI of an image, as discussed in Sect. 10.2.5.
Of course, the maximum FPPI is capped by the number of segments that the
breast is divided into. Using fewer segments means that the FPPI will be lower, but
the cost of the detection of the TPs is substantially more difficult because the area is
Using the same algorithm outlined in Ryan et al. (2014), we segmented the
breast images into three overlapping sub-images of roughly similar size, as shown
in Fig. 10.2.Thefirstofthesecapturesthenippleareaandtheothertwocoverthe
top and bottom sections of the rest of the breast. The three segments intersect, to
help reduce the possibility of a mass going unnoticed.
258 C. Ryan et al.
In summary, each patient has two breasts, and mammograms are taken for two
views (CC and MLO) of each breast—giving a total of four mammograms per
patient. From these four mammograms we obtain sixteen images: four images of
the full breast (left CC, left MLO, right CC, right MLO) and three sub images (top,
bottom, nipple) for each of these four images. We construct our training and test
data with features obtained from these sixteen images/sub-images.
10.3.3 Textural Features
As with most image classification systems, before attempting classify mammograms
as suspicious or not we must first extract features for GP. In this study, we use
Haralick’s Texture Features (Haralick et al. 1973) as we believe that textural features
are appropriate in this case because we are examining parenchymal patterns, and our
hypothesis is that suspicious areas are likely to be texturally dissimilar to normal
areas. However, different features may be suited for other problem domains, in
which case the detection and selection of these problem specific features can simply
slot into the work-flow at this juncture.
The seminal work of Haralick et al. (1973) described a method of generating
14 measures which can be used to form 28 textural features from a set of co-
occurrence matrices or “grey tone spatial dependency matrices”. When applied to
pixel grey levels, the Grey Level Co-occurrence Matrix (GLCM) is defined to be
the distribution of co-occurring values at a given offset. In other words, GLCM is
relationship. That relationship is typically specified by assuming that the second
pixel is at a specific offset with respect to the first one.
Given a neighbourhood relationship r,anelementc.i;j/of a GLCM of image m
is the probability that a pixel pand its neighbour pixel qhave brightness values i
and jrespectively, i.e., Pr.r.p;q/^m.p/Di^m.q/Dj/.
Using the co-occurrence matrix, different properties of the pixel distribution can
be generating by applying various calculations to the matrix values.
Given the image matrix in Table 10.1 which handles three grey levels, the
co-occurrence matrix below is obtained by moving over the image matrix and
calculating f.i;j/where f.i;j/is the frequency that grey levels iand joccur with
at a given distance and direction.
Tab l e 10. 1 Pixel grey levels 00012
10 Image Classification with Genetic Programming 259
Tab l e 10. 2 Co-occurrence
matrix 012
For example, f.0; 0/ D8is obtained by scanning the image matrix, and for
each pixel with a grey value of zero incrementing f.0; 0/ every time one of its
neighbours on the horizontal direction at a distance of 1, also has a value of zero.
Co-occurrence matrices can also be generated in other directions: 90, 135 and 45
degrees (vertical and diagonal), and for distances other than one. In this work
we examine a neighbourhood of one and average the feature values for the four
orientations (Table 10.2).
Haralick et al. (1973)showedthatGLCMsconvenientlylendthemselvesto
efficient calculation of various informative measures, including:
1. Angular Second Moment
2. Contrast
3. Correlation
4. Sum of squares
5. Inverse Difference Moment
6. Sum Average
7. Sum Variance
8. Sum Entropy
9. Entropy
10. Difference Variance
11. Difference Entropy
12. Information Measure of Correlation 1
13. Information Measure of Correlation 2
14. Maximal Correlation Coefficient
For a chosen distance there are four spatial dependency matrices corresponding
to the four directions 0ı,45
ıand 135ı, giving four values for each of the 14
Haralick texture measures listed. There are some issues with this approach. The
amount of data in the co-occurrence matrices varies with the range and number of
values chosen for neighbourhoodand direction and will be significantly higher that
the amount of data in the original image. Simple examples of the method found in
the literature typically use few gray levels for ease of explanation. However, in real-
life applications the number of grey levels is likely to be significant. This obviously
greatly increases the volume of matrix data: there will be an n"nmatrix for each
direction and each distance chosen, where nis the number of gray levels. Also, the
resulting matrices are often very sparse as certain combinations of brightness may
never occur in an image. In spite of these obvious downsides, Haralick features are
widely used in the research.
260 C. Ryan et al.
To quantitatively describe the textural characteristics of breast tissue, we
calculate a GLCM for each segment and for each breast. To keep the GLCM
size manageable, we first reduce the number of gray levels to 256 (from 65,535 in
the original images) via linear scaling. Because textures in mammograms are often
anisotropic (directionally dependent), we independently calculate GLCMs for four
orientations corresponding to two adjacent and two diagonal neighbours. Next, we
calculate 13 Haralick features (Haralick et al. 1973) (we exclude the 14th feature:
Maximal Correlation Coefficient as it can be computationally unstable Woods
2008). By doing this for each orientation, we obtain 52 features per segment,
which may subsequently be passed to a ML system for classification. This down-
sampling of gray levels, construction of GLCMs and extraction of Haralick features
is achieved using MATLAB (2013).
Segments are rectangular and often extend beyond the breast, which means that
they may contain some background.A GLCM calculated from such a segment in the
normal way would register very high values for black pixels(m.p/D0or m.q/D0)
which may distort the values of Haralick features. As many mammographic images
contain useful information captured in black pixels, such as sections of adipose
tissue (fat), which appears black in mammograms, it would not be correct to simply
ignore black pixels. Therefore, before calculating the GLCM, we increase by one
the intensity of every pixel within the breast, using the information resulting from
the segmentation stage (see previous subsection). The pixels that already had the
maximal value retain it (this causes certain information loss, albeit negligibleone, as
there are typically very few such pixels). Then, once the GLCM has been calculated,
we simply “hoist” the GLCM up and to the left to remove the impact of the
unmodified background pixels.
Feature Selection As previously mentioned, the neighbourhood relation of the
GLCM can be varied, such that the calculation is conducted on pixels further away
from each other, but, each extra neighbourhood examined produces another 52
features per segment. In this work we examine the neighbourhoods composed of
direct neighbours only (i.e.,at a distance of 1 from the reference pixel)and averaged
the feature values for the four orientations.
We conducted a preliminary analysis of the 13 computed Haralick features
where we initially examined variance across and between both classes and then
carried out a more formal analysis using several ranker methods (Hall et al. 2009)
which ranked the attributes according to the concept of information gain.Inthis
context information gain can be thought of as a measure of the value of an attribute
which describes how well that attribute separates the training examples according
to their target class labels. Information gain is also known as Kullback Leibler
divergence (Kullback and Leibler 1951), information divergence or relative entropy.
Information gain employs the idea of entropy as used in information theory. These
feature selection steps suggested that the most promising features in terms of
discrimination were contrast and difference entropy. Accordingly, we discarded the
other features and let GP focus on those two.
10 Image Classification with Genetic Programming 261
10.4 Experimental Setup
In this section, we describe the construction and distributions of the datasets used,
together with details of configurations of that data for specific experiments. We also
provide details of the GP parameters andclassicationalgorithmemployed.
10.4.1 Dataset Construction
This work employs University of South Florida Digital Database for Screening
Mammography (DDSM) (Heath et al. 2001)whichisacollectionof43“volumes
of mammogram cases. A volume is a collection of mammogram cases (typically
about 50–100 different patients) and can be classified as either normal,positive,
benign or benign without callback.All patients in a particular volume have the same
classification. We use cases from the cancer02 and three of the normal volumes
(volumes 1–3). For this study we do not use images from either the benign or benign
without callback volumes.
The incidence of positives within mammograms is roughly 5 in 1000,1giving a
massively imbalanced data set. To ensure that our training data maintains a more
realistic balance, we deliberately select only a single volume of positive cases.
Several images were discarded either because of image processing errors or
because we were unable to confidently identify which segment/s were cancerous
for a particular positive case. In the current work, this latter task was performed
manually. We will automate this step in the next iteration. This initial processing
resulted in a total of 294 usable cases, 75 of which contain cancerous growths(which
we call positive in the remainder of this document). Each case initially consists of
images for the left and right breasts and for the MLO and CC views of each breast.
Once the segmentation step has been completed images are added for each of the
three segments (nipple/top/bottom) for each view of each breast. Thus, there are a
total of four images per view for each breast: one for the entire breast (A), and one
for each of the three segments (At;Ab;An).
If we count the numbers of positives and negatives in terms of breasts rather than
cases, which is reasonable, given thateach is examined independently (i.e. most, but
not all, patients with cancerous growths do not have them in both breasts), then the
number of non-cancerous images increases significantly: giving two for each non-
cancerous case and one for most cancerous growths. For the volumes studied, of the
75 usable positive cases, 3 have cancer in both breasts. Thus, consideringfull breast
CC images only, we have 78 positive images and 510 (219 * 2 + 72) negative ones
Turning our attention to segments (At;Ab;An) (excluding full breast images), and
again considering only CC segments for the moment, for each non-cancerous case
1The actual incidence over a patient’s lifetime is closer to 1 in 7 (Kerlikowske et al. 1993).
262 C. Ryan et al.
we have 3 segments for each breast (left and right) together with 2 non-cancerous
segments for each cancerous breast which gives a total of 1686 non-cancerous
segments and 78 cancerous segments. Similarly, for the MLO view there are 1686
non-cancerous segments and 78 cancerous ones.
Thus, we obtain three different distributions, one for the non-segmented single
views (CC or MLO) full breast images (78 positives (P), and 510 negatives, (N)),
one for the segmented single views (78 Ps and 1686 Ns) and on e fo r segmented
combined CC MLO views (156 Ps and 3372 Ns). Each of these three distributions
exhibit very significant class imbalance which, in and of itself, increases the level of
difficulty of the classification problem. The imbalance in the data was handled in all
cases by using Proportional Individualised Random Sampling (Fitzgerald and Ryan
2013), as described in Sect. 10.4.2
Based on this master dataset, we consider several setups representing different
configurations of breasts, segments and views (see Table 10.3). The following
terminology is used to describe the composition of instances for a given setup,
where an instance is a single training or test example in a dataset: BXSYVZ,where
Xis the number of breasts, Ythe number of segments and Zthe number of views
for a given instance.Inthecaseswherethereisjustoneview(B1S1V1,B2S2V1,
Tab l e 10. 3 Experimental
configurations Name Ps Ns Description
B1S0V1 78 510 1breast,unsegmentedimage,1
view, uses CC view only
B1S1V1 78 1686 1breast,1seg.,1view;usesCC
view only
B1S2V2 156 3372 1breast,2segs.,2views;uses
both CC and MLO views
B1S3V1 78 1686 1breast,3segs.,1view;CC
view only
B2S0V1 78 510 2breasts,unsegmentedimage,
B2S2V1 78 1686 2breasts,2segs.,1view;both
CC views, one segment from
B2S4+0V1 78 1686 2breasts,4segs.,1view,CC
views, 1 segment +
unsegmented from each
B2S3+0V1 78 1686 2breasts,3segs.,1view,CC
views, 1 segment from each +
unsegmented from first
B2S4V1 78 1686 2breasts,4segs.,1view;CC
views, three segments + one
B2S6V1 78 1686 2breasts,6segs.,1view;CC
views, three segments + one
segment + 2 unsegmented
Each was generated from the same master data set
10 Image Classification with Genetic Programming 263
B1S3V1, B2S4V1) we use the CC views, while in the cases where the breast has
been segmented, the system attempts to classify whether or not the segment has a
suspicious area or not. In particular, the two breast (B2SYV1) special setups which
investigate the use of asymmetry. These rely solely on the CC view: each instance
is comprised of selected features from one breast CC segment/s together with the
same features taken from the corresponding other breast CC segment/s for the same
When it comes to processing the data, we wantto exploit any differencesbetween
a segment and the rest of the breast (i.e. between Aand Ax) but also between a
segment and the corresponding segment from the opposite breast, (say Band Bx),
with the objective of evolving a classifier capable of pinpointing a specific cancerous
particular setup, features from the segment of interest are the first occurring data
items in each instance for the dataset for that setup, where the segment of interest is
the segment for which we want to obtain a prediction. Details of he specific setups
used in the current study are as follows: B1S0V1
This dataset configuration has an instance for the selected featuresof each full breast
image. It uses the CC view only and has an instance for each breast for each patient.
It has 78 Ps and 501 Ns. B1S1V1
The BIS1V1 configuration also uses only the CC view, but this setup uses each
of the three segments (At;Ab;An) separately, i.e each instance is comprised of the
feature values for a single segment. Again there is an instance for each breast for
each segment. This results in 78Ps and 1686 Ns. B1S2V2
Both views are used in the B1S2V2 setup. For each segment, excluding the full
breast image, for each breast, each instance contains feature values for that segment
and the corresponding segment for the other view (CC or MLO), i.e each instance
has information for both views of a single breast. So for a given segment, say At,
there are instances for the following:
264 C. Ryan et al.
In this setup the segments of interest are AtLEFT_CC,AtLEFT_MLO,At
RIGHT_CC and AtRIGHT _MLO respectively,i.e the segment whose featuresoccur
segment is used. B1S3V1
This configuration uses three CC segments (At;Ab;An)forasinglebreast,wherethe
first segment is alternated in successive instances For example, for a given single
breast there are three training instances. Similar to:
Where the order of the remaining two segments does not matter. B2S2V1
In this configuration we investigate the simplest case of symmetry: each entry
consists of the feature values for a single CC segment from one breast combined
with those of the corresponding CC segment from the other breast, for the same
patient. In this case there are two entries for each segment: (AxLEFT_CC, Ax
RIGHT_CC) and (AxRIGHT_CC, AxLEFT_CC), where xrepresents a particular
segment (At;Ab;An). B2S3+0V1
There are two set-ups which deviate slightly from the naming scheme above,
namely, B2S3+0V1 and B2S4+0V1. Here, +0 indicates that features for a non-
segmented image have been included. Each instance in this setup is comprised of
feature data from segmented and unsegmented images. It consists of information for
a segment, the unsegmented image and the corresponding segment from the other
breast. For example:
10 Image Classification with Genetic Programming 265 B2S4+0V1
Similar to B2S3+0V1, each instance in this setup is again comprised of feature data
from segmented and unsegmented images. It consists of information for a segment,
the unsegmented image and the corresponding segment from the other breast
together with the unsegmented image data for the other breast. For example:
The B2S4V1 experimental setup is a combination of B1S3V1 and B2S2V1 where
each training instance is comprised of the feature values for the three segments for
asinglebreast(A) combined with the corresponding segment from the other breast
Where in this instance AtLEFT_CC is the segment of interest. B2S6V1
The final experimental setup is an extension of B2S4V1 where feature values for
the full breast segment for the right and left breasts are added. For example:
Where in this instance AtLEFT_CC is the segment of interest.
It is important to note here is that where more than one segment is used the
segment of interest is the first occurring leftmost one, for example, AtLEFT_CC in
the B2S6V1 setup example above. If that segment is diagnosed as cancerous then
the training/test instance in which it occurs is marked as positive, and if it is not
diagnosed as cancerous then the entire instance is marked as negative rega rdless
of the cancer status of any other segments used in that particular instance.Thus,
excluding the B1S0V1 setup, the objective is not simply to determine if a given
breast is positive for cancer, but rather to pinpoint which segments are positive.
If successful, this capability could pave the way for further diagnosis.
10.4.2 Proportional Individualised Random Sampling
In each of our experimental configurations there is significant disparity in the num-
ber of positive to negative instances. Greater disparity makes classification problems
much more challenging, as there is an inherent bias towards the class which has
266 C. Ryan et al.
greater representation in the dataset—in this case, the negative class. When a
ML algorithm, designed for general classification tasks and scored according to
classification accuracy, is faced with significant imbalance, the “intelligent” thing
for it to do is to predict that all instances belong to the majority class. Ironically,
it is frequently the case that the minority class is the one which contains the
most important or interesting information. In datasets from the medical domain,
such as our mammographic data it is often the case that instances which represent
malignancy or disease are far fewer than those which do not.
Various approaches to mitigating class imbalance problems have been proposed
in the literature. In general, methods can be divided into those which tackle the
imbalance at the data level, and those which propose an algorithmic solution. In
addition, several hybrid approaches have been advanced which combine aspects of
the other two.
Methods which operate on the data level attempt to repair the imbalance by
rebalancing training data. This is usually achieved by either under-sampling the
majority class or over-sampling the minority class, where the former involves
removing some examples of the majority class and the latter is achieved by adding
duplicate copies of minority instances until such time as some predefined measure
of balance is achieved. Over- or under-sampling may be random in nature (Batista
et al. 2004)or“informed”(Kubatetal.1997), where in the latter, various criteria are
used to determine which instances from the majority class should be discarded. An
interesting approach called SMOTE (Synthetic Minority Oversampling Technique)
was suggested by Chawla et al. (2002) in which rather than over sampling the
minority class with replacement they generated new synthetic examples.
At the algorithmic level Joshi et al. (2001)modiedthewellknownAdaBoost
(Freund and Schapire 1996)algorithmsothatdifferentweightswereappliedfor
boosting instances of each class. Akbani et al. (2004)modiedthekernelfunction
in a Support Vector Machine implementation to use an adjusted decision threshold.
Class imbalance tasks are closely related to cost based learning problems, where
misclassification costs are not the same for both classes. Adacost (Fan et al. 1999)
and MetaCost (Domingos 1999) are examples of this approach. See Kotsiantis et al.
(2006), He and Garcia (2009)forathoroughoverviewoftheseandvariousother
methods described in the literature.
There several disadvantages to the application of over or under sampling
strategies. The obvious downside to under-sampling is that it discards potentially
useful information. The main drawback with standard over sampling is that exact
copies of minority instances are introduced into the learning system, which may
increase the potential for over-fitting. Also, the use of over-sampling generally
results in increased computational cost because of the increased size of the dataset.
In this study, we have employed a proportional sampling approach (Fitzgerald and
Ryan 2013) which eliminates or mitigates these disadvantages.
Using this approach the size of the dataset remains unchanged so there is no
extra computational cost, as is generally the case with random over sampling.
Instead, the number of instances of each class is varied. At each generation
and for each individual in the population the percentage of majority instances
is randomly selected in the range between the percentages of minority (positive)
10 Image Classification with Genetic Programming 267
and majority (negative) instances in the original distribution. Then, that particular
individual is evaluated on that percentage of majority instances with instances of the
minority class making up the remainder of the data. In both cases, each instance is
randomly selected with replacement.Inthisway,individualswithinthepopulation
are evaluated with different distributions of the data within the range of the original
The benefit of the method from the under sampling perspective is that while
the majority class may not be fully represented at the level of the individual, all
of the data for that class is available to the population as a whole. Because all of
the available knowledge is spread across the population the system is less likely to
suffer from the loss of useful data that is normally associated with under sampling
techniques. From the under sampling viewpoint, over-fitting may be less likely as
the distribution of instances of each class is varied for each individual at every
generation. Also, as all sampling is done with replacement, there may be duplicates
of negative as well as positive instances.
Previous work (Liu and Khoshgoftaar 2004)hasshownthatasidefromthe
consideration of balance in the distribution of instances, the use of randomsampling
techniques may have a beneficial effect in reducing over-fitting.
10.4.3 GP Methodology and Parameters
All experiments used a population 200 individuals, running for 60 generations, with
acrossoverrateof0:8 and mutation rate of 0:2. The minimum initial depth was
four, while the maximum depth was 17. The instruction set was small, consisting of
just C;!;#;=. The tree terminals (leaves) fetch the Haralick features as defined in
Sect. 10.3.3,withtwoavailablepersegment.
To tra nsform a contin uous output of a GP tree into a nom inal de cisio n (Positive,
Negative), we binarize it using the method described in Fitzgerald and Ryan (2012),
which optimizes the binarization threshold individually for each GP classifier.
We employed an NSGA-II (Deb et al. 2002)algorithmasupdatedinFortinand
Parizeau (2013) as the selection and replacement strategy. Using a population of
200, at each generation, 200 new offspring are generated,then parents and offspring
are merged into one population pool before running pareto-based selection to select
the best 200). During evolution, we aim to minimize three fitness objectives, where
AUC is a the area under ROC, calculated using the Mann-Whitney (Stober and Yeh
2007) test, where the false positive rate (FPR) and TPR are calculated with the
output threshold set using the binarization technique mentioned above:
The chosen multi-objective fitness function is specifically tailored to suit the
mammography task. However, it would be quite straightforward to modify the
268 C. Ryan et al.
work-flow and GP system to accommodate a different set of objectives or a single
valued or composite single objective fitness function to suit a problem from an
alternative domain.
We performed stratified fivefold cross-validation (CV, Geisser 1993;Hastieetal.
2009)forallsetups.However,wealsoretained10 %ofthedataasa“holdout”(HO)
test set, where for each meta run (each consisting of 5 CV runs) this HO test set
data was separated from the CV data prior to the latter’s allocation to folds for CV.
The data partitioning was carried out using the sci-kit learn ML toolkit (Pedregosa
et al. 2011). We conducted 50 cross-validated runs (each consisting of 5 runs) with
identical random seeds for each configuration outlined in Table 10.3.
10.5 Results
In this section we present our experimental results firstly with regard to AUC
measure on the training and test partitions of the CV phase. Secondly we examine
the TPR and FPR for this data. Finally we explore the results for each performance
metric, adopting various approaches to model selection, this time taking the
performance on hold-out data into consideration.
10.5.1 AUC
Figure 10.4,leftplotshowsthechangeinaverageAUCovergenerationsontheCV
training partitions averaged over all cross validated runs, whereas Fig. 10.4 right
plot shows the development of the best training AUC also averaged over all cross
validated runs. Similarly, Fig. 10.5 left plot shows the change in average AUC over
generations on the CV test partition averaged over all cross validated runs, and
Fig. 10.5 right plot shows the development of the best population test AUC also
averaged over all cross validated runs.
It appears that the best performing setups from the perspectives of both training
and test partitions are those which leverage information from both breasts, the single
breast configuration which uses all three segmentsorsinglebreastsetupwhichuses
features from the unsegmented image. The B2S2V1 configuration delivers “middle
of the road” AUC figures: better results than the two worst performing setups but
worse that the better ones.
Clearly some of the worst AUC results are achieved with the configurations
which use segments from a single breast, particularly that which uses two views (CC
and MLO) of the same area (top, bottom or nipple). The latter is not very surprising
as the features contain essentially the same information and having features which
are strongly correlated with each other is known to be detrimental to accurate
10 Image Classification with Genetic Programming 269
0.9 0.95
Avg Train AUC Best Train AUC
0 10 20 30 40 50 60
Fig. 10.4 (Left)AveragepopulationtrainingAUCoverfivevalidationfolds,averagedoverall
cross validated runs. (Right)BesttrainingAUCaveragedoverallcrossvalidatedruns.Whiskers
represent the standard error of the mean
010 20 30 40 50 60
Average Test AUC Best Test AUC
Fig. 10.5 (Left)AveragepopulationtrainingAUCoverfivevalidationfolds,averagedoverall
cross validated runs. (Right)BesttrainingAUCaveragedoverallcrossvalidatedruns.Whiskers
represent the standard error of the mean
Overall, the results suggest that simply increasing the number of segments gives
a significant boost to performance in terms of training fitness but that the strategy
does not necessarily improve results on test data.
10.5.2 TP/FPRs
Population average TPR and FPRs for training data are shown in Fig.10.6 and the
corresponding rates on test data can be seen in Fig.10.7.Theplotsexemplifythe
tension which exists in the population between the two competing objectives of
270 C. Ryan et al.
Average Train TP Rate
TP Rate
FP Rate
Average Train FP Rate
Fig. 10.6 (Left)AverageCVtrainingTPRaveragedoverallcrossvalidatedruns.(Right)Average
CV training FPR averaged over all 50 cross validated runs. Whiskers represent the standard error
of the mean
010 20 30 40 50 60
Average Train TP Rate
TP Rate
FP Rate
Average Train FP Rate
Fig. 10.7 (Left)AverageCVTPRontestpartitionsaveragedoverallcrossvalidatedruns.(Right)
Average C V FPR on test partit ions averag ed over all 5 0 cross val idated runs. Whi skers rep re sent
the standard error of the mean
maximizing the TPR while simultaneously trying to minimize the FPR. In general,
a configuration which produces a higher than average TPR will also produce a
correspondingly higher FPR. For any configuration, there will always be individuals
within the population which classify all instances as either negative or positive. In
order to accurately distinguish which configurations are likely to deliver a usable
classification model it is more useful to examine the results of the best performing
individuals in the population on the various metrics: TPR, FPR and AUC. We
explore this aspect in Sect. 10.5.4.
10 Image Classification with Genetic Programming 271
Size in Nodes
Avg Size in Nodes
Fig. 10.8 Average po pu latio n si ze for each c onfigurati on. Whi skers represe nt th e st andard er ro r
of the mean
10.5.3 Program Size
Turning our attention to the average size of the individuals produced by each
configuration as shown in Fig. 10.8, we can see that there is a substantial difference
in average size between the smallest and the largest, and the difference appears
to increase as evolution progresses. The smallest individuals are produced by
the non-segmented configurations, and the next smallest by the most feature rich
B2S6V1 setup. The largest programs result from the B1S3V1 configuration which,
interestingly,has half as many feature values for each instance as B2S6V1 does. We
can hypothesise that this may be because with fewer feature values, the system needs
to synthesize them itself, which would be fairly typical evolutionary behaviour.
10.5.4 Model Selection
As described earlier in Sect. 10.4.3 the NSGA-II multi-objective GP (MOGP)
algorithm (Deb et al. 2002; Fortin and Parizeau 2013) was used to drive selection
according to performance against our three objectives of maximizing AUC and TPR
while also minimizing FPR. When using this type of algorithm for problems where
272 C. Ryan et al.
there is a natural tension between objectives which may necessitate trade-offs, the
system typically does not return a single best individual at the end of evolution, but
rather a Pareto front or range of individuals representing various levels of trade-off
between the different objectives.
Due to the relationship between the main objectives for the mammography
task, we choose to use the multi-objective algorithm. Preliminary experiments with
various composite single objective fitness functions had not proved very successful
and previous work (Ryan et al. 2014) had demonstrated the effectiveness of the
MOGP approach. However, for this particular task, we are not interested in the
pareto front of individuals, after all, a model with a zero FPR and zero or very
low TPR (every instance classified as N) is not much use in this context. What we
really care about is achieving the lowest possible FPR for the highest possible TPR.
Thus, during evolution we maintained a single entry “hall of fame” (HOF) for each
CV iteration, whereby as we evaluated each new individual on the training data, if it
had a higher TPR or if it had an equal TPR but a lower FPR to the HOF incumbent
for that CV iteration, the new individual replaced that HOF incumbent.
We report re sults on th e t r aining an d t est CV segm e n ts but th e m o st importa n t
results are those for the HO test set, as these provide an indication of how the
system might be expected to perform on new, unseen instances. We choose to
present results, with best results in bold text, underseveral different modelselection
Mean average best trained individual: results for each HOF are firstly averaged
for each CV run and then averaged across the 50 cross validated runs. See
Tabl e 10.4.
Average best trained individual: the best trained individual is chosen from each
CV run and the results for these 50 best individuals are averaged. See Table 10.5.
chosen from amongst the 5 HOF members for each CV run and the results of this
50 individuals are averaged. See Table 10.6.
from amongst the 250 best trained solutions.SeeTable10.8.
The results on training data shown in Tables 10.4,10.5,10.6,10.7 and 10.8 show
that simply adding features gives a boost to performance. The configuration with
the greatest number of features (B2S6V1) consistently produces the lowest FPR
and the best AUC score on the training data. However, this setup appears to suffer
from over-fitting, as the excellent training results do not translate into good test
results, as evidenced by the low TPR on the hold out test data. This configuration
has the largest number of segments, and, as each added segment contributes two
extra features—it also has more features than the others.
Regardless of which model selection approach we choose to adopt for evaluating
performance on the hold out test data, the best evolvedmodel is produced by the two
breast non-segmented configuration (B2S0V1) which has a best result TPR of 1with
setups perform worst overall.
10 Image Classification with Genetic Programming 273
Tab l e 10. 4 Mean average training, test and hold out TPR, FPR AUC of best trained individuals
Method Train TPR Train FPR Train AUC Test T PR Te st FPR Tes t AUC HO TPR HO FPR HO AUC
B1S0V1 10:60 0:78 0:92 0:63 0:73 10:66 0:76
B1S1V1 10:62 0:74 0:94 0:65 0:69 10:65 0:80
B1S2V2 10:72 0:71 0:97 0:74 0:68 0:97 0:72 0:70
B1S3V1 10:48 0:82 0:93 0:50 0:77 0:96 0:51 0:76
B2S0V1 10:49 0:81 0:92 0:54 0:75 10:51 0:83
B2S2V1 10:55 0:77 0:96 0:58 0:74 10:57 0:82
B2S3+0V1 10:57 0:76 0:93 0:59 0:73 0:96 0:55 0:77
B2S4+0V1 10:54 0:76 0:92 0:57 0:71 0:96 0:52 0:78
B2S4V1 10:48 0:82 0:92 0:52 0:76 0:97 0:52 0:78
B2S6V1 10:40 0:84 0:92 0:45 0:77 0:86 0:46 0:73
274 C. Ryan et al.
Tab l e 10. 5 Averag e best trainin g, test a nd hold o ut TPR, FP AUC accord ing t o best trained individuals for each run
Method Train TPR Train FPR Train AUC Test T PR Te st FPR Tes t AUC HO TPR HO FPR HO AUC
B1S0V1 10:47 0:83 0:86 0:51 0:75 0:99 0:51 0:79
B1S1V1 10:53 0:77 0:90 0:56 0:69 10:54 0:85
B1S2V2 10:67 0:74 0:95 0:68 0:72 0:96 0:66 0:74
B1S3V1 10:37 0:86 0:90 0:39 0:81 0:95 0:41 0:80
B2S0V1 10:39 0:84 0:85 0:41 0:77 10:38 0:86
B2S2V1 10:48 0:79 0:93 0:51 0:75 0:99 0:48 0:86
B2S3+0V1 10:47 0:80 0:90 0:51 0:74 0:98 0:41 0:80
B2S4+0V1 10:45 0:80 0:88 0:48 0:72 0:96 0:43 0:82
B2S4V1 10:37 0:85 0:88 0:39 0:79 0:96 0:39 0:83
B2S6V1 10:32 0:87 0:86 0:38 0:76 0:82 0:36 0:76
10 Image Classification with Genetic Programming 275
Tab l e 10. 6 Average train ing, test a nd hold out TPR, FP AU C ac cording to best test individuals for each run
Method Train TPR Train FPR Train AUC Test TPR Te st FPR Te st AUC HO TP HO FPR HO AUC
B1S0V1 10:61 0:77 0:98 0:64 0:76 10:67 0:76
B1S1V1 10:62 0:73 10:65 0:71 10:66 0:79
B1S2V2 10:72 0:73 10:73 0:71 0:96 0:72 0:71
B1S3V1 10:49 0:82 10:48 0:84 0:96 0:50 0:77
B2S0V1 10:51 0:80 10:55 0:78 10:51 0:82
B2S2V1 10:55 0:78 10:57 0:77 10:55 0:84
B2S3+0V1 10:57 0:76 0:99 0:57 0:76 0:97 0:55 0:79
B2S4+0V1 10:56 0:76 0:99 0:56 0:77 0:98 0:52 0:80
B2S4V1 10:48 0:81 0:99 0:51 0:78 0:96 0:51 0:78
B2S6V1 10:42 0:83 10:44 0:82 0:89 0:45 0:75
276 C. Ryan et al.
Tab l e 10. 7 Training, test and hold out TPR, FPR AUC of single best trained individual
Method Train TPR Train FPR Train AUC Test T PR Te st FPR Tes t AUC HO TPR HO FPR HO AUC Size
B1S0V1 10:30 0:84 0:80 0:38 0:77 0:75 0:29 0:80 687
B1S1V1 10:43 0:79 0:86 0:47 0:66 10:41 0:88 473
B1S2V2 10:61 0:78 0:89 0:62 0:73 10:61 0:77 537
B1S3V1 10:26 0:90 0:80 0:26 0:82 10:28 0:88 271
B2S0V1 10:31 0:89 0:79 0:27 0:87 10:19 0:91 979
B2S2V1 10:36 0:88 0:94 0:49 0:79 10:48 0:88 441
B2S3+0V1 10:37 0:85 0:92 0:63 0:70 10:58 0:78 211
B2S4+0V1 10:35 0:81 0:85 0:35 0:80 10:34 0:84 805
B2S4V1 10:27 0:88 0:80 0:29 0:82 10:30 0:89 341
B2S6V1 10:20 0:93 0:69 0:38 0:68 0:92 0:38 0:83 613
10 Image Classification with Genetic Programming 277
Tab l e 10. 8 Training, test and hold out TP, FP AUC of single best test individual
Method Train TPR Train FPR Train AUC Test T PR Te st FPR Tes t AUC HO TPR HO FPR HO AUC Size
B1S0V1 10:56 0:78 10:37 0:88 10:44 0:88 193
B1S1V1 10:59 0:77 10:50 0:82 10:58 0:87 391
B1S2V2 10:67 0:73 10:66 0:76 0:95 0:66 0:74 401
B1S3V1 10:35 0:81 10:34 0:89 0:80 0:37 0:73 21
B2S0V1 10:43 0:87 10:38 0:83 10:37 0:83 341
B2S2V1 10:48 0:79 10:50 0:81 10:46 0:87 171
B2S3+0V1 10:44 0:75 10:39 0:81 10:39 0:82 509
B2S4+0V1 10:40 0:82 10:37 0:84 10:38 0:86 599
B2S4V1 10:36 0:85 10:36 0:85 10:38 0:79 419
B2S6V1 10:30 0:87 10:29 0:92 0:77 0:31 0:79 349
278 C. Ryan et al.
Both B2S3+0V1 and B2S4+0V1 are two breast configurations which combine
features of segmented and unsegmented images. They are essentially combinations
of B2S2V1 and B2S0V1—both of which deliver good results on the hold out test
data. The augmented methods do not appear to contribute a huge improvement,
although we see that for several views of the data, they produce a very competitive
low FPR.
When we compare the figures for average program size between the single best
individuals selected based on training or CV test partitions it is interesting to note
that those selected on test performance are almost universally smaller than the ones
selected based on training results. However, the best overall individual is the largest,
at 979 nodes.
To compare with results from the literature we convert the FPRs into FPPI which
we report in Table 10.9. Here, the average results reported result from the data in
Tabl e 10.4 which represents the mean average results for all of the best trained
individuals. The best results use the TP and FP data of the single best individuals
selected based on performance on CV test partitions. Results refer to performance
on the crucial hold out test data.
Clearly the best results are produced by the two breast non-segmented approach
B2S2V1 with a TPR of 1and an FPPI of 0:33.Thisiscloselyfollowedbyitssingle
breast counterpart B1S0V1 which again delivered a perfect TPR and an FPP1 of
Of the segmented setups the two augmented configurations of B2S3+0V1 and
B2S4+0V1 also produced good results with perfect TPRs combined with good
FPPIs of 1:11 and 1:08 respectively. Also the B2S4V1 method did very well with
aTPRof1and FPP1 of 1:11.Contrastthesefigureswiththeresultsreportedin
Sect. 10.2.5 with scores of 97 % TP and FPPIs of 4–15.
Overall, several of our configurations proved capable of correctly classifying
100 %ofthecancerouscaseswhileatthesametimehavingalowFPPI,andthe
best results were delivered by individuals trained to view breast asymmetry.
Tab l e 10. 9 Mean average
TPR and FPPI of best trained
individual, TPR and FPPI of
single best trained individual,
both on HO data
Method Avg TPR Av g FPPI Best TPR Best FPP1
B1S0V1 10:61 10:41
B1S1V1 11:88 11:68
B1S2V2 0:97 2:03 0:95 1:86
B1S3V1 0:96 1:49 0:80 1:08
B2S0V1 10:45 10:33
B2S2V1 11:67 11:34
B2S3+0V1 0:96 1:57 11:11
B2S4+0V1 0:96 1:48 11:08
B2S4V1 0:97 1:52 11:11
B2S6V1 0:86 1:28 0:77 1:06
10 Image Classification with Genetic Programming 279
10.6 Conclusions and Future Work
We have presented an entire work-flow for automated mammogram analysis with
GP as its cornerstone. Our system operates with raw images, extracts the features
and presents them to GP, which then evolves classifiers. The result is a Stage 1
cancer detector that achieves 100 % accuracy on unseen test data from the USF
mammogram library, with a lowest reported FPPI of 0:33.
This work-flow can be applied to virtually any image classification task with
GP, simply by using task-specific features. In this chapter we use textural features
which can be directly employed by any problem with similar images, which means
that the only modification required to the system is the manner in which the images
are segmented.
The experimental set up that had the lowest FPPI was the one that compared both
entire breasts, showing that we successfully leveraged textural breast asymmetry as
a potential indicator for cancerous growths. Additionally, several of the segmented
configurations also produced very good results, indicating that the system is
capable of not only identifying with high accuracy which breasts are likely to have
suspicious lesions but also which segments contain suspicious areas. The first of
these capabilities could prove useful in providing second reader functionality to
busy radiologists, whereas the second supply inputs to an automated diagnostic
system where further analysis can be undertaken.
One minor limitation of this work is that all of the positive cases examined
came from the same volume. However, it is reasonable to assume that for any
automated system, a classifier will be generated for a specific type of X-ray machine
used by a