An Integrated Approach to Stage 1 Breast Cancer
Jeannie M. Fitzgerald
University of Limerick, Ireland
University of Limerick, Ireland
University of Limerick, Ireland
Poznan University of
We present an automated, end-to-end approach for Stage 1
breast cancer detection. The ﬁrst phase of our proposed
work-ﬂow takes individual digital mammograms as input
and outputs several smaller sub-images from which the back-
ground has been removed. Next, we extract a set of features
which capture textural information from the segmented im-
In the ﬁnal phase, the most salient of these features are fed
into a Multi-Objective Genetic Programming system which
then evolves classiﬁers capable of identifying those segments
which may have suspicious areas that require further inves-
A key aspect of this work is the examination of several new
experimental conﬁgurations which focus on textural asym-
metry between breasts. The best evolved classiﬁer using
such a conﬁguration can deliver results of 100% accuracy on
true positives and a false positive per image rating of just
0.33, which is better than the current state of the art.
Categories and Subject Descriptors
1.2.2 [Artiﬁcial Intelligence]: ;Automatic Programming
Mammography; Classiﬁcation; Multi-Objective Genetic Pro-
Routine mammographic screening, particularly at a national
level, is by far the most eﬀective tool for the early detection
and subsequent successful treatment of breast cancer [30,
32]. It is essential to discover signs of cancer early, as sur-
vival is directly correlated with early detection .
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proﬁt or commercial advantage and that copies bear this notice and the full citation
on the ﬁrst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission
and/or a fee. Request permissions from email@example.com.
GECCO ’15, July 11 - 15, 2015, Madrid, Spain
2015 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-3472-3/15/07. . . $15.00
The introduction of breast screening programs has con-
tributed to a signiﬁcantly higher demand for radiologists,
and a world wide shortage of qualiﬁed radiologists who spe-
cialize in mammography  has led to many radiologists
being dangerously overworked . This is likely to lead to
(i) there being insuﬃcient time for radiologists to interpret
mammograms (which are notoriously diﬃcult to read); (ii)
an inability to provide redundant readings (second reader);
and (iii) radiologists being overly conservative, which in turn
is likely to increase the number of patient call backs, poten-
tially resulting in unnecessary biopsies which may lead to
patient anxiety and mistrust of the system.
Fortunately, the increased availability of digital mammog-
raphy means that it is now much more feasible to use au-
tomated methods to assist with detection. A mammogram
is performed by compressing the breast between two plates
which are attached to a mammogram machine: an adjustable
plate on top with a ﬁxed x-ray plate underneath. An image
is recorded using a digital detector located on the bottom
plate. Two views of each breast are recorded: the cranio-
caudal (CC) view, which is a top down view, and the medi-
olateral oblique (MLO) view, which is a side view taken at
an angle. Functional breast tissue is termed parenchyma
and this appears as white areas on a mammogram, while
the black areas are composed of adipose (non-functioning,
fatty) tissue which is transparent under X-rays.
Various levels of automation exist in mammography, and
these can generally be divided into Computer-Aided Detec-
tion (CAD) and Computer-Aided Diagnosis (CADx) . In
this work we concentrate exclusively on CAD. In particular,
what is known as Stage 1 detection.
A stage 1 detector examines mammograms and highlights
suspicious areas that require further investigation. For this
task, it is important to strike a balance: an overly conser-
vative approach degenerates to marking every mammogram
(or segment of) as suspicious, while missing a cancerous area
can have disastrous consequences. Our objective is to de-
velop a stage 1 detector that is highly accurate in terms of
detecting suspicious areas (True Positives (TP)) with as few
false alarms (False Positives (FP)) as possible. In the lit-
erature, it is relatively standard practice to convert FP to
False Positives Per Image (FPPI).
The potential for CAD to improve screening mammogra-
phy outcomes by increasing the cancer detection rate has
been shown in several retrospective studies including that
of Cupples et al.  who reported an overall increase of 16%
in the cancer detection rates using CAD together with tra-
ditional detection methods. In that study, CAD increased
the detection rate of small invasive cancers by 164%.
The remainder of this paper is organised as follows: In
Section 2 we outline related work, then in Section 3 we pro-
vide a detailed description of our proposed workﬂow, next
in Section 4 we further describe our experiments and report
results and ﬁnally, in Section 5 we state our conclusions and
propose avenues for future research.
2. RELATED WORK
Previous GP research eﬀort such as that of Nandi et al. [14,
24] has successfully tackled both feature selection and the
classiﬁcation of previously identiﬁed abnormalities as either
benign or malignant.
Other notable research using GP is that Ahmad et al. 
who designed a Stage 2 cancer detector for the well known
Wisconsin Breast Cancer dataset, in which they used the
features extracted from a series of ﬁne needle aspirations
and an evolved neural network. Ludwig and Roos  used
GP to estimate the prognosis of breast cancer patients from
the same data set, using GP to reduce the number of features
before evolving predictors. Langdon and Harrison  took
a diﬀerent approach, using biopsy gene chip data, but their
system approached a similar level of automation.
More recent GP eﬀort in mammography has been con-
cerned with a combination of feature detection and classiﬁ-
cation. One such study by Ryan et al.  reports a best
TP Rate of 100% with an FPPI of just 1.5%. In other work,
the best reported seems to be that of Li et al  which
reports a 97.3% TP rate and 14.81 FPPI. Similar work by
Polakowski  had a lower TP rate (92%) but with a much
lower FPPI rate (8.39).
The standard method of reporting results is to use a TP/F-
PPI breakdown, which is what we will present here. How-
ever, it is important to note that while TP rates may be di-
rectly comparable if the same volumes are used, FPPI may
not be, as the metric depends to some extent on the num-
ber of mammographic regions examined and the detection
objective of the system. See 3.1 for further explanation.
Regarding the various types of features that have been
studied, those which capture aspects of shape , edge-
sharpness  and texture  have been used for mass seg-
mentation and detection. Nandi et al.  reported a clas-
siﬁcation accuracy of 98% using a combination of all three
of these feature types.
The majority of existing GP systems operate only at the
classiﬁcation stage and rely upon previously extracted fea-
tures, and with the exception of Ryan et al. , all of the
previous work mentioned above, deals with a single breast
in isolation. In this paper, we are inﬂuenced by the research
of Tabar  which indicates that, in general, both breasts
from the same patient have the same textural characteristics.
Our hypothesis is that the existence of suspicious areas may
be more likely if a patient’s breasts are texturally diﬀerent
from each other.
Another strong argument in favour of considering textural
aspects is the connection between breast density and breast
cancer. Density of breast tissue is an important attribute
of the parenchyma and it has been established that mam-
mograms of dense breasts are more challenging for human
experts. At the same time, repeated studies have demon-
strated that women with dense tissue in greater than 75%
of the breast are 4- to 6 times more likely to develop breast
cancer compared with women with little to no breast den-
A complete review of the important relevant research is
beyond the scope of this paper. The interested reader is
directed to  for an evaluation of the recent state-of-the-
art in computer-aided detection and diagnosis systems.
Given the importance of parenchymal density as a risk fac-
tor and the diﬃculty for human experts in identifying suspi-
cious areas in this challenging environment, we believe that
a stage 1 detector which focuses on textural asymmetry may
have a strong decision support role to play in the identiﬁ-
cation of suspicious mammograms. GP is particularly suit-
able for this type of task as it is very ﬂexible, is capable of
automated feature selection and extraction, and most im-
portantly in the medical domain – provides an explanation
for classiﬁcation decisions in the form of a human readable
With this in mind, we develop an automated stage 1 detec-
tion system with GP  at its centre, where parenchymal
texture and textural asymmetry between both breasts of the
same patient determines the choice of image features (fea-
ture detection) and the construction of input datasets which
leverage this data. In doing so, we extend the work of  by
considering a diﬀerent set of features and a greater number
Our work-ﬂow begins with background suppression and
image segmentation and then progresses to the generation of
textural features. These are fed through the system, which
performs feature reduction before passing the most salient
ones to GP to evolve classiﬁers. The best resulting classiﬁers
deliver results of 100% accuracy on true positives and a false
positive per image rating of just 0.33, which is appreciably
better than prior work.
Figure 1: Image Segmentation
3.1 Background Suppression and Image Seg-
The background of the mammographic image is never per-
fectly homogeneous, and it includes at least one tag letter
indicating whether the image is either a right or left breast.
This is often augmented by a string of characters indicating
which view (CC or MLO) was taken. These backgrounds
need to be replaced with homogeneous ones to correctly pro-
cess the image in a later stage. We used the same approach
as in  to suppress the image background.
We chose to divide each image into three segments, and
to examine each segment separately. As there can be more
than one suspicious area in an image, we return true for as
many segments as the system ﬁnds suspicious, meaning that
an image can have several positives returned. With Stage 1
detectors such as ours, this is described by the FPPI of an
image. As the maximum FPPI is capped at the number of
segments that the breast is divided into, using fewer seg-
ments means that the FPPI will be lower. On the other
hand, accurate detection of the TPs is substantially more
diﬃcult as the area is larger.
Using the same algorithm as described in , we divided
the breast images into three overlapping segments of roughly
similar size as shown in ﬁgure 1: one segment captures the
nipple area and one each for of the top and bottom of the
remainder of the breast. The overlapping is designed to help
reduce the possibility of a mass going undetected.
3.2 Feature Detection
Before attempting to classify mammograms as suspicious or
not we must ﬁrst extract features that GP can use to dis-
tinguish between classes. In this study, we have chosen to
use Haralick’s Texture Features . Textural features are
appropriate in this case because we are examining parenchy-
mal patterns, and our hypothesis is that suspicious areas are
likely to be texturally dissimilar to normal areas.
The seminal work of Haralick et al.  described a method
of generating 14 measures which can be used to form tex-
tural features from a set of co-occurrence matrices or “grey
tone spatial dependency matrices”. When applied to pixel
grey levels, the Grey Level Co-occurrence Matrix (GLCM)
is deﬁned to be the distribution of co-occurring values at a
given oﬀset. Using the co-occurrence matrix, diﬀerent prop-
erties of the pixel distribution can be generating by applying
various calculations to the matrix values.
To quantitatively describe the textural characteristics of
breast tissue, we calculate a GLCM for each segment and
for each breast. To keep the GLCM size manageable, we
ﬁrst reduce the number of gray levels to 256 (from 65535 in
the original images). We independently calculate GLCMs
for four orientations corresponding to two adjacent and two
diagonal neighbours. Next, we calculate the Haralick fea-
tures , which reﬂect (among others) contrast, entropy,
variance, and correlation of pixel values. In this work we
examine a neighbourhood of 1 and averaged the feature val-
ues for the four orientations. The down-sampling of gray
levels, construction of GLCMs and extraction of Haralick
features is achieved using Matlab .
Segments are rectangular and often extend beyond the
breast, thus containing some background information. A
GLCM calculated from such a segment in a conventional
way would register very high values for black pixels and so
distort the values of Haralick features. As many mammo-
graphic images contain sections of adipose tissue, which ap-
pears black in mammograms and, which is in its own right
useful information, we should not ignore black pixels. There-
fore, before calculating the GLCM, we increase by one the
intensity of every pixel within the breast, using the informa-
tion resulting from the segmentation stage. The pixels that
already have the maximal value retain it (this causes certain
information loss, albeit negligible one, as there are typically
very few such pixels). Then, once the GLCM has been cal-
culated, we simply “hoist” the GLCM up and to the left to
remove the impact of the unmodiﬁed background pixels.
We conducted a preliminary analysis of the 13 computed
Haralick features examining variance across and between
both classes and then carried out a more formal analysis us-
ing several ranker methods  which ranked the attributes
according to the concept of information gain . In this
context information gain can be thought of as a measure
of the value of an attribute which describes how well that
attribute separates the training examples according to their
target class labels. These feature selection steps suggested
that the most promising features in terms of discrimination
were contrast and diﬀerence entropy. Accordingly, we dis-
carded the other features and let GP focus on those two.
3.3 Dataset Construction
We use the University of South Florida Digital Database for
Screening Mammography (DDSM)  which is a collection
of 43 “volumes” of mammogram cases. A volume is a col-
lection of mammogram cases and can be classiﬁed as either
normal,positive,benign or benign without callback. All pa-
tients in a particular volume have the same classiﬁcation.
We use cases from the cancer 02 and three of the normal
volumes (volumes 1 to 3).
The incidence of positives within mammograms is roughly
5 in 1000 giving a massively imbalanced data set. To ensure
that our training data maintains a realistic balance, we de-
liberately select only a single volume of positive cases. In
constructing training and test data several images were dis-
carded either because of image processing errors or because
we were unable to conﬁdently identify which segment/s were
cancerous for a particular positive case. This initial process-
ing resulted in a total of 294 usable cases 75 of which con-
tain cancerous growths (which we call positive from now on).
Each case initially consists of images for the left and right
breasts and for the MLO and CC views of each breast. Once
the segmentation step has been completed images are added
for each of the three segments (nipple/top/bottom) for each
view of each breast. Thus, there are a total of four images
for each breast, for each view: one for the entire breast (A),
and one for each of the three segments (At, Ab, An).
If we count the numbers of positives (P) and negatives (N)
in terms of breasts rather than cases, which is reasonable,
given that each is examined independently (i.e. most, but
not all, patients with cancerous growths do not have them
in both breasts), then the number of non-cancerous images
increases signiﬁcantly: giving two for each non-cancerous
case and one for most cancerous growths. For the volumes
studied, of the 75 usable positive cases, 3 have cancer in
both breasts. Thus, considering full breast CC images only,
we have 78 positive images and 510 (219 * 2 + 72) negative
Turning our attention to segments (At, Ab, An) (excluding
full breast images), and again considering only CC segments
for the moment, for each non cancerous case we have 3 seg-
ments for each breast (left and right) together with 2 non
cancerous segments for each cancerous breast which gives
a total of 1686 non cancerous segments and 78 cancerous
segments. Similarly, for the MLO view there are 1686 non
cancerous segments and 78 cancerous ones.
Thus, we obtain three distributions: one for the non-
segmented single views (CC or MLO) full breast images (78
positives (Ps), and 510 negatives, (Ns)); one for the seg-
mented single views (78 Ps and 1686 Ns); and one for seg-
mented combined CC MLO views (156 Ps and 3372 Ns).
Each of these three distributions exhibit very signiﬁcant
class imbalance which, in and of itself, increases the level
Name Ps Ns Breasts Segs Views Description
B1S0V1 78 510 1 1 CC Unsegmented (full breast image).
B1S1V1 78 1686 1 1 CC Single segment.
B1S2V2 156 3372 1 2 CC + MLO 1 segment for each view.
B1S3V1 78 1686 1 3 CC 3 segments (At, Ab, An).
B2S0V1 78 510 2 2 CC Unsegmented images from both breasts.
B2S2V1 78 1686 2 2 CC 1 segment from each breast.
B2S4+0V1 78 1686 2 4 CC 1 segment + unsegmented from each.
B2S3+0V1 78 1686 2 3 CC 1 segment from each + unsegmented from ﬁrst.
B2S4V1 78 1686 2 4 CC 3 segments + one segment.
B2S6B1 78 1686 2 6 CC 3 segments + one segment + 2 unsegmented.
Table 1: Experimental Conﬁgurations. Each was generated from the same master data set.
of diﬃculty of the classiﬁcation problem. The imbalance in
the data was mitigated in all cases by using Proportional
Individualised Random Sampling, as described in 
Based on this master dataset, we consider several setups
representing diﬀerent conﬁgurations of breasts, segments and
views (see Table 1). The following terminology is used to de-
scribe the composition of instances for a given setup: BXSYVZ,
where Xis the number of breasts, Ythe number of seg-
ments and Zthe number of views for a given instance. In
the cases where there is just one view (B1S1V1, B2S2V1,
B1S3V1, B2S4V1) we use the CC views, while in the cases
where the breast has been segmented, the system attempts
to classify whether or not the segment has a suspicious area
or not. In particular, the two breast (B2SYV1) special se-
tups which investigate the use of asymmetry. These rely
solely on the CC view: each instance is comprised of se-
lected features from one breast CC segment/s together with
the same features taken from the corresponding other breast
CC segment/s for the same patient.
There are two setups which deviate slightly from the nam-
ing scheme above: B2S3+0V1 and B2S4+0V1. Here, +0 in-
dicates that features for a non segmented image have been
We want to exploit any diﬀerences between a segment
and the rest of the breast (i.e. between Aand Ax) but also
between a segment and the corresponding segment from the
opposite breast, (say Band Bx), with the objective of evolv-
ing a classiﬁer capable of pinpointing a speciﬁc cancerous
segment. To facilitate this process, where we use more than
one segment for a particular setup, features from the seg-
ment of interest are the ﬁrst appearing data items in each
instance for the dataset for that setup. Details of he speciﬁc
setups used in the current study are as follows:
This dataset conﬁguration has an instance for the selected
features of each full breast image. It uses the CC view only
and has a separate instance for each breast for each patient.
The BIS1V1 conﬁguration also uses only the CC view, but
this setup uses each of the three segments (At, Ab, An) sep-
arately, i.e each instance is comprised of the feature values
for a single segment. Again there is an instance for each
breast for each segment.
Both views are used in the B2S2V2 setup. For each seg-
ment, excluding the full breast image, for each breast, each
instance contains feature values for that segment and the
corresponding segment for the other view (CC or MLO).
So for a given segment, say At, there are instances for the
AtLEFT CC, AtLEFT MLO
AtLEFT MLO, AtLEFT CC
AtRIGHT CC, AtRIGHT MLO
AtRIGHT MLO, AtRIGHT CC
In this setup the segments of interest are AtLEF T CC ,
AtLEF T M LO,AtRIGHT CC and AtRIGH T M LO
respectively, i.e the segment whose features occur ﬁrst. This
principle applies to all of the remaining setups, where more
than one segment is used.
This conﬁguration uses three CC segments (At, Ab, An) for
a single breast, where the ﬁrst segment is alternated in suc-
cessive instances For example, for a given single breast there
are three training instances similar to:
AtLEFT CC, AbLEFT CC, AnLEFT CC
AbLEFT CC, AnLEFT CC, AtLEFT CC
AnLEFT CC AtLEFT CC, AbLEFT CC
Where the order of the remaining two segments does not
In the B2S0V1 we study a simple case of symmetry where
each instance is comprises features for 2 segments: one for
each full breast image, left and right, CC view only,
ALEFT CC, ARIGHT CC
ARIGHT CC, ALEFT CC
In this conﬁguration we investigate another case of sym-
metry: each entry consists of the feature values for a sin-
gle CC segment from one breast combined with those of
the corresponding CC segment from the other breast, for
the same patient. In this case there are two entries for
each segment: (AxLEFT CC, AxRIGHT CC) and (Ax
RIGHT CC, AxLEFT CC), where xrepresents a particu-
lar segment (At, Ab, An).
Each instance in this setup is comprised of feature data from
segmented and unsegmented images. It consists of informa-
tion for a segment, the unsegmented image and the corre-
sponding segment from the other breast. For example:
AtLEFT CC, ALEFT CC,BtLEFT CC
Similar to B2S3+0V1, each instance in this setup is again
comprised of feature data from segmented and unsegmented
images. It consists of information for a segment, the un-
segmented image and the corresponding segment from the
other breast together with the unsegmented image data for
the other breast. For example:
AtLEFT CC, ALEFT CC,BtLEFT CC,BLEFT CC
The B2S4V1 experimental setup is a combination of B1S3V1
and B2S2V1 where each training instance is comprised of the
feature values for the three segments for a single breast (A)
combined with the corresponding segment from the other
breast (B) for the leftmost, ﬁrst occurring segment of A.
AtLEFT CC, AbLEFT CC, AnLEFT CC, BtLEFT CC
Where in this instance AtLEFT CC is the segment of in-
The ﬁnal experimental setup is an extension of B2S4V1
where feature values for the full breast segment for the right
and left breasts are added. For example:
AtLEFT CC, AbLEFT CC, AnLEFT CC,ALEFT CC,
BtLEFT CC, BLEFT CC
Where in this instance AtLEFT CC is the segment of in-
It is important to note here is that where more than one
segment is used the segment of interest is the ﬁrst occurring
leftmost one. If that segment has been diagnosed as can-
cerous then the training / test instance in which it occurs
is marked as positive, and if it has not been diagnosed as
cancerous then the entire instance is marked as negative re-
gardless of the cancer status of any other segments used in
that particular instance. Thus, excluding the B1S0V1 setup,
the objective is not simply to determine if a given breast is
positive for cancer, but rather to pinpoint which segments
are positive. If successful, this capability could pave the
way for further diagnosis.
4. EXPERIMENTS AND RESULTS
All experiments used a population 200 individuals, running
for 60 generations, with a crossover rate of 0.8 and muta-
tion rate of 0.2. The minimum initial depth was 4, while
the maximum depth was 17. The instruction set was small,
consisting of just +,−,∗,\. The tree terminals (leaves) are
selected from the available Haralick features, with two avail-
able per segment.
To transform a continuous output of a GP tree into a
nominal decision (Positive, Negative), we binarize it using
the method described in , which optimizes the binariza-
tion threshold individually for each GP classiﬁer.
For our selection and replacement strategy we employed
an NSGA-II  Multi Objective GP (MOGP) algorithm as
described in  and updated in . We chose to use the
multi-objective algorithm due to the relationship between
the main objectives for the mammography task. Prelim-
inary experiments with various composite single objective
ﬁtness functions had not proved very successful and pre-
vious work  had demonstrated the eﬀectiveness of the
The NSGA-II algorithm was used to drive selection ac-
cording to performance on three diﬀerent objectives. When
using this type of algorithm for problems which may neces-
sitate trade-oﬀs due to a natural tension between objectives,
the system typically does not return a single best individual
at the end of evolution, but rather a Pareto front, or range of
individuals representing various levels of trade-oﬀ between
the diﬀerent objectives. However, for this particular task,
we are not interested in the pareto front of individuals, after
all, a model with a zero FPR and zero or very low TPR
(every instance classiﬁed as N) is not of much practical use
in this context. What we really care about is achieving the
lowest possible FPR for the highest possible TPR. Thus,
during evolution we maintained a single entry “hall of fame”
(HOF) for each CV iteration, whereby as we evaluated each
new individual on the training data, if it had a higher TPR
or if it had an equal TPR but a lower FPR to the HOF in-
cumbent for that CV iteration, the new individual replaced
that HOF incumbent.
Using a population of 200, at each generation, 200 new oﬀ-
spring are generated, then parents and oﬀspring are merged
into one pool before running Pareto-based selection to select
the best 200. During evolution, we aim to minimize three
ﬁtness objectives: FP Rate, 1−TP Rate and 1−AUC, where
AUC is a the area under ROC, calculated using the Mann-
Whitney  test.
We performed stratiﬁed ﬁve-fold cross-validation (CV, )
for all setups. However, we also retained 10% of the data
as a “hold out” (HO) test set, where for each set of 5 cross-
validated runs this HO test set data was separated from
the CV data prior to the latter’s allocation to folds for CV.
The data partitioning was carried out using the sci-kit learn
Machine Learning (ML) toolkit . We conducted 50 cross-
validated runs (each consisting of 5 runs) with identical ran-
dom seeds for each conﬁguration outlined in Table 1.
In this section we present our experimental results ﬁrstly
with regard to AUC measure on the training and test par-
titions of the CV phase, before we examine the TP and FP
rates for this data. Finally we explore the results for each
performance metric, this time taking the performance on
hold-out data into consideration.
The plots in ﬁgure 2, show the evolution in population av-
erage AUC on the CV training data, development of the
best population training AUC, change in population aver-
age AUC over the generations on the CV test data, and the
evolution of the best population test AUC, where each of
these represent metrics which are averaged over all cross-
It seems that the best performing setups from the per-
spectives of both training and test CV data are those which
leverage information from both breasts, the single breast
conﬁguration which uses all three segments or the single
breast setup which uses features from the unsegmented im-
Clearly some of the worst AUC results are achieved with
the conﬁgurations which use segments from a single breast,
Average and best population training and test AUC, averaged over all cross-validated runs.
particularly that which uses two views of the same breast.
This is not very surprising as the features contain essen-
tially the same information, and having features which are
strongly correlated is known to be detrimental to accurate
Overall, the results suggest that increasing the number of
segments gives a signiﬁcant boost to performance in terms
of training ﬁtness but that the strategy does not necessarily
improve results on test data.
4.1.2 TP/FP Rates
Population average TP and FP rates for training data and
the corresponding rates on test data can be seen in ﬁgure 3.
The plots exemplify the tension which exists in the popu-
lation between the two competing objectives of maximizing
the True Positive Rate (TPR) while simultaneously trying
to minimize the False Positive Rate (FPR). In general, a
conﬁguration which produces a higher than average TPR
will also produce a correspondingly higher FPR. For any
conﬁguration, there will always be individuals within the
population which classify all instances as either negative or
positive. In order to accurately distinguish which conﬁgura-
tions are likely to deliver a usable classiﬁcation model it is
better to examine the results of the best performing individ-
uals in the population on the various metrics: TP rate, FP
rate and AUC. We explore this aspect in section 4.1.3.
4.1.3 Model Selection
We report results on the training and test CV segments but
the most important results are those for the HO test set,
as these provide an indication of how the system might be
expected to perform on new, unseen instances.
To compare with results from the literature we convert
the FPR into FPPI which we report in table 3. Here, the
results reported arise from the data shown in table 2 which
represents the mean average results for the best trained in-
dividuals. Results for each HOF are ﬁrstly averaged for
each run in a CV set and then averaged across the 50 cross-
validated runs. These results refer to performance on the
crucial hold HO data.
Clearly the best results are produced by the two breast non-
segmented approach B2S2V1 with a TPR of 1 and an FPPI
of 0.33. This is closely followed by its single breast counter-
part B1S0V1 which again delivered a perfect TPR and an
FPP1 of 0.41.
Of the segmented setups the two augmented conﬁgura-
tions of B2S3+0V1 and B2S4+0V1 also produced good re-
sults with perfect TPR combined with good FPPIs of 1.11
and 1.08 respectively. Also the B2S4V1 method did very
well with a TPR of 1 and FPP1 of 1.11.
Overall, several of our conﬁgurations proved capable of
correctly classifying 100% of the cancerous cases while at
the same time having a low FPPI, and the best results were
delivered by individuals trained to view breast asymmetry.
5. CONCLUSIONS AND FUTURE WORK
TPR and FPPI produced by the most successful experimen-
tal conﬁgurations compare well with the results reported in
section 2, and also reinforce the quality of previous results
reported in  as diﬀerent volumes have been used on this
occasion. We hypothesise that the improvement in FPPI
over the previous work is largely due to the addition of the
Haralick contrast attribute.
The experimental set up with the lowest FPPI was the
one that compared both entire breasts, showing that we suc-
cessfully leveraged textural breast asymmetry as a potential
indicator for abnormalities. Additionally, several of the seg-
mented conﬁgurations also produced very good results, in-
dicating that the system is capable of not only identifying
with high accuracy which breasts are likely to have suspi-
cious lesions but also which segments con<tain suspicious
Average and best population training and test TP and FP rates, averaged over all cross-validated runs
Method Train TP Train FP Train AUC Test TP Test FP Test AUC HO TP HO FP HO AUC
B1S0V1 1 0.60 0.78 0.92 0.63 0.73 1 0.66 0.76
B1S1V1 1 0.62 0.74 0.94 0.65 0.69 1 0.65 0.80
B1S2V2 1 0.72 0.71 0.97 0.74 0.68 0.97 0.72 0.70
B1S3V1 1 0.48 0.82 0.93 0.50 0.77 0.96 0.51 0.76
B2S0V1 1 0.49 0.81 0.92 0.54 0.75 1 0.51 0.83
B2S2V1 1 0.55 0.77 0.96 0.58 0.74 1 0.57 0.82
B2S3+0V1 1 0.57 0.76 0.93 0.59 0.73 0.96 0.55 0.77
B2S4+0V1 1 0.54 0.76 0.92 0.57 0.71 0.96 0.52 0.78
B2S4V1 1 0.48 0.82 0.92 0.52 0.76 0.97 0.52 0.78
B2S6V1 1 0.40 0.84 0.92 0.45 0.77 0.86 0.46 0.73
Table 2: Mean average training, test and hold out TP, FP AUC of best trained individuals
Method Avg TP Avg FPPI Best TP Best FPP1
B1S0V1 1 0.61 1 0.41
B1S1V1 1 1.88 1 1.68
B1S2V2 0.97 2.03 0.95 1.86
B1S3V1 0.96 1.49 0.80 1.08
B2S0V1 1 0.45 1 0.33
B2S2V1 1 1.67 1 1.34
B2S3+0V1 0.96 1.57 1 1.11
B2S4+0V1 0.96 1.48 1 1.08
B2S4V1 0.97 1.52 1 1.11
B2S6V1 0.86 1.28 0.77 1.06
Table 3: Mean average TP and FPPI of best trained individuals,
TP and FPPI of single best trained individual, on HO data.
Best trained individuals are selected according to the algorithm de-
scribed in Section 4.
areas. The ﬁrst of these capabilities could prove useful in
providing second reader functionality to busy radiologists,
whereas the second may provide inputs into an automated
Future work will focus on further reﬁning abnormality de-
tection such that the speciﬁc location of suspicious areas
within segments may be identiﬁed. We are also exploring
the possibility of developing ensemble classiﬁers where each
member may have been trained on a diﬀerent type of X-Ray
machine. This may be possible as digital mammograms are
in a format which contains meta-data which includes details
of the speciﬁc machine and location where the mammogram
K. Krawiec acknowledges support from the Ministry of Science
and Higher Education grant 09/91/DSPB/0572. The remaining
authors gratefully acknowledge the support of Science Foundation
Ireland, grant number 10/IN.1/I3031.
 Arbab Masood Ahmad, Gul Muhammad Khan,
Sahibzada Ali Mahmud, and Julian Francis Miller. Breast
cancer detection using cartesian genetic programming
evolved artiﬁcial neural networks. In Terry Soule et al.,
editors, GECCO ’12: Proceedings of the fourteenth
international conference on Genetic and evolutionary
computation conference, pages 1031–1038, Philadelphia,
Pennsylvania, USA, 7-11 July 2012. ACM.
 Leonard Berlin. Liability of interpreting too many
radiographs. American Journal of Roentgenology,
 Mythreyi Bhargavan, Jonathan H. Sunshine, and Barbara
Schepps. Too few radiologists? American Journal of
Roentgenology, 178:1075–1082, 2002.
 Keir Bovis and Sameer Singh. Detection of masses in
mammograms using texture features. In Pattern
Recognition, 2000. Proceedings. 15th International
Conference on, volume 2, pages 267–270. IEEE, 2000.
 Tommy E. Cupples, Joan E. Cunningham, and James C.
Reynolds. Impact of computer-aided detection in a regional
screening mammography program. American Journal of
Roentgenology, 186:944–950, 2005.
 Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and
TAMT Meyarivan. A fast and elitist multiobjective genetic
algorithm: Nsga-ii. Evolutionary Computation, IEEE
Transactions on, 6(2):182–197, 2002.
 Jeannie Fitzgerald and Conor Ryan. Exploring boundaries:
optimising individual class boundaries for binary
classiﬁcation problems. In Proceedings of the 14th
international conference on Genetic and evolutionary
computation conference, GECCO ’12, pages 743–750, New
York, NY, USA, 2012. ACM.
 Jeannie Fitzgerald and Conor Ryan. A hybrid approach to
the problem of class imbalance. In International Conference
on Soft Computing, Brno, Czech Republic, June 2013.
 F´elix-Antoine Fortin and Marc Parizeau. Revisiting the
nsga-ii crowding-distance computation. In Proceedings of
the 15th Annual Conference on Genetic and Evolutionary
Computation, GECCO ’13, pages 623–630, New York, NY,
USA, 2013. ACM.
 Mark Hall, Eibe Frank, Geoﬀrey Holmes, Bernhard
Pfahringer, Peter Reutemann, and Ian H. Witten. The
weka data mining software: an update. SIGKDD Explor.
Newsl., 11(1):10–18, November 2009.
 R. et al Haralick. Texture features for image classiﬁcation.
IEEE Transactions on Systems, Man, and Cybernetics,
 Trevor Hastie, Robert Tibshirani and Jerome Friedman.
The elements of statistical learning, volume 2. Springer,
 Michael Heath, Kevin Bowyer, Daniel Kopans, Richard
Moore, and W. Philip Kegelmeyer. The digital database for
screening mammography. In M.J. Yaﬀe, editor, Proceedings
of the Fifth International Workshop on Digital
Mammography, pages 212–218. Medical Physics Publishing,
 Rolando R Hern´andez-Cisneros, Hugo Terashima-Mar´ın,
and Santiago E Conant-Pablos. Comparison of class
separability, forward sequential search and genetic
algorithms for feature selection in the classiﬁcation of
individual and clustered microcalciﬁcations in digital
mammograms. In Image Analysis and Recognition, pages
911–922. Springer, 2007.
 J. Koza. Genetic programming: A paradigm for genetically
breeding populations of computer programs to solve
problems. Technical Report STAN-CS-90-1314, Dept. of
Computer Science, Stanford University, June 1990.
 Solomon Kullback and Richard A Leibler. On information
and suﬃciency. The Annals of Mathematical Statistics,
pages 79–86, 1951.
 W.B. Langdon and A.P. Harrison. Gp on spmd parallel
graphics hardware for mega bioinformatics data mining.
Soft Computing, 12(12):1169–1183, 2008.
 Huai Li, Yue Wang, KJ Ray Liu, S-CB Lo, and Matthew T
Freedman. Computerized radiographic mass detection. i.
lesion site selection by morphological enhancement and
contextual segmentation. Medical Imaging, IEEE
Transactions on, 20(4):289–301, 2001.
 Simone A. Ludwig and Stefanie Roos. Prognosis of breast
cancer using genetic programming. In Rossitza Setchi et al.,
editors, 14th International Conference on Knowledge-Based
and Intelligent Information and Engineering Systems
(KES 2010), Part IV, volume 6279 of LNCS, pages
536–545, Cardiﬀ, UK, September 8-10 2010. Springer.
 M Markey M. Sampat and A Bovik. Computer-aided
detection and diagnosis in mammography. In Alan C.
Bovik, editor, Handbook of Image and Video Processing.
Elsevier Academic Press, 2010.
 MATLAB. version 8.2 (R2012a). MathWorks Inc., Natick,
 Valerie A McCormack and Isabel dos Santos Silva. Breast
density and parenchymal patterns as markers of breast
cancer risk: a meta-analysis. Cancer Epidemiology
Biomarkers & Prevention, 15(6):1159–1169, 2006.
 Naga R Mudigonda, Rangaraj M Rangayyan, and JE Leo
Desautels. Gradient and texture analysis for the
classiﬁcation of mammographic masses. Medical Imaging,
IEEE Transactions on, 19(10):1032–1043, 2000.
 R. J. Nandi, A. K. Nandi, R. Rangayyan, and D. Scutt.
Genetic programming and feature selection for classiﬁcation
of breast masses in mammograms. In 28th Annual
International Conference of the IEEE Engineering in
Medicine and Biology Society, EMBS ’06, pages
3021–3024, New York, USA, August 2006. IEEE.
 F. Pedregosa et al. Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research,
 Nicholas Petrick, Berkman Sahiner, Samuel G Armato III,
Alberto Bert, Loredana Correale, Silvia Delsanto,
Matthew T Freedman, David Fryd, David Gur, Lubomir
Hadjiiski, et al. Evaluation of computer-aided detection and
diagnosis systemsa). Medical physics, 40(8):087001, 2013.
 W. E. Polakowski, D. A. Cournoyer, and S. K. Rogers.
Computer-aided breast cancer detection and diagnosis of
masses using diﬀerence of gaussians and derivative-based
feature saliency,. IEEE Trans. Med. Imag., 16:811–819,
 Rangaraj M Rangayyan, Nema M El-Faramawy, JE Leo
Desautels, and Onsy Abdel Alim. Measures of acutance
and shape for classiﬁcation of breast tumors. Medical
Imaging, IEEE Transactions on, 16(6):799–810, 1997.
 Conor Ryan, Krzysztof Krawiec, Una-May O’Reilly,
Jeannie Fitzgerald, and David Medernach. Building a stage
1 computer aided detector for breast cancer using genetic
programming. In M. Nicolau et al., editors, 17th European
Conference on Genetic Programming, volume 8599 of
LNCS, pages 162–173, Granada, Spain, 23-25 April 2014.
 Robert A Smith, Stephen W Duﬀy, and L´aszl´o Tab´ar.
Breast cancer screening: the evolving evidence. Oncology,
 Paul Stober and Shi-Tao Yeh. An explicit functional form
speciﬁcation approach to estimate the area under a receiver
operating characteristic (roc) curve. Available at,
http://www2. sas. com/proceedings/sugi27/p226–227. pdf,
Accessed March, 7, 2007.
 Tabar, L. et al. A new era in the diagnosis of breast cancer.
Surgical oncology clinics of North America, 9(2):233–77,
 T. Tot, L. Tabar, and P. B. Dean. The pressing need for
better histologic-mammographic correlation of the many
variations in normal breast anatomy. Virchows Archiv,
437(4):338–344, October 2000.