Content uploaded by Orlando Belo
Author content
All content in this area was uploaded by Orlando Belo
Content may be subject to copyright.
Discovering Telecom Fraud Situations through Mining
Anomalous Behavior Patterns
Ronnie Alves, Pedro Ferreira,
Orlando Belo, Joao Lopes, Joel
Ribeiro
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
{ronnie, pedrogabriel,
obelo}@di.uminho.pt
Luís Cortesão
Portugal Telecom Inovação, SA,
Rua Eng. José Ferreira Pinto Basto
3810-106 Aveiro
PORTUGAL
lcorte@ptinovacao.pt
Filipe Martins
Telbit, Lda,
Rua Banda da Amizade, 38
3810-059 Aveiro
PORTUGAL
fmartins@telbit.pt
ABSTRACT
In this paper we tackle the problem of superimposed fraud
detection in telecommunication systems. We propose two
anomaly detection methods based on the concept of signatures.
The first method relies on a signature deviation-based approach
while the second on a dynamic clustering analysis. Experiments
carried out with real data, voice call records from an entire week,
corresponding to approximately 2.5 millions of CDRs and 700
thousand of signatures processed per day, allowed us to detect
several anomalous situations. The frauds analysts provide us a
small list of 12 customers for whom a fraudulent behavior was
detected during this week. Thus, 9 and 11 fraud situations were
discovered from each method respectively. Preliminary results
and discussion with fraud analysts has already proved that our
methods are a valuable tool to assist them in fraud detection.
1. INTRODUCTION
In superimposed fraud situations, the fraudsters make an
illegitimate use of a legitimate account by different means. In this
case, some abnormal usage is blurred into the characteristic usage
of the account. This type of fraud is usually more difficult to
detect and poses a bigger challenge to the telecommunications
companies. Telecommunications companies use since the 90's
decade several kinds of approaches based on statistical analysis
and heuristics methods to assist them in the detection and
categorization of fraud situations. Recently, they have been
adopting the use and exploitation of data mining and knowledge
discovery techniques for this task. In this paper we tackle the
problem of superimposed fraud detection in telecommunication
systems. Two methods for discovering fraud situations through
mining anomalous customers’ behavior patterns are presented.
These methods are based on the concept of signature [3], which
has already been used successfully for anomalous detection in
many areas like credit card usage [1], network intrusion [2] and in
particular in telecommunications fraud [3]. Our goal was to detect
deviate behaviors in useful time, giving better basis to analysts to
be more accurate in their decisions in the establishment of
potential fraud situations.
2. THE ROLE OF SIGNATURES ON
DETECTING FRAUD
Our technique has as a core concept on the notion of signature.
We emphasize the work of Cortes and Pregibon [3], since it was
the main inspiration for the use of signatures. We have redefined
their notion of signature. A signature of a user corresponds to a
vector of feature variables whose values are determined during a
certain period of time. The variables can be simple, if they consist
into a unique atomic value (ex: integer or real) or complex, if they
consist in two co-dependent statistical values, typically the
average and the standard deviation of a given feature.
Table 1. Description of the fv used in signature and summary.
Description Type
Duration of Calls Complex
N. of Calls – Working Days Complex
N. of Calls – Weekends and Holidays Complex
N. of Calls – Working Time (8h-20h) Complex
N. of Calls – Night Time (20h-8h) Complex
N. of Calls to Diff. National Networks Simple
N. of Calls as Caller (Origin) Simple
N. of Calls as Called (Destination) Simple
N. of International Calls Simple
N. of Calls as Caller in Roaming Simple
N. of Calls as Called in Roaming Simple
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA.
Copyright 2006 ACM 1-59593-439-1...$5.00
The choice of the type of the variables depends on several factors,
like the complexity of the feature described or the data available
to perform such calculation. A feature like the duration of the
calls shows a significant variability which is much better
expressed through an average(µ)/standard-deviation(σ)
parameter. A feature like the number of international calls is
typically much less frequent and thus an average value is
sufficient to describe it. In table 1 we list the complete set of
feature variables (fv) used in the context of this work. A signature
S is then obtained from a function φ for a given temporal window
ω, where S = φ(ω). We consider a time unit, the amount of time in
which the CDRs are accumulated and that in the end of this
period are processed. A summary C, has the same information
structure as a signature, but it is used to resume the user behavior
in a smaller time period. Typically, a signature reflects the usage
patterns for a period of a
week, a month or even half year,
whereas a summary reflects the periods of an hour, a half
day or complete day. In this work, we considered the
period of one day for a summary and a week for the
signature.
3. DEVIATING PATTERNS
3.1 Evaluating Similarities among Signatures
3.1.1 Similarity of Simple Feature Variables
A simple feature is defined by a unique variable, which
corresponds to the average value of the considered feature. For
simple feature variable comparison we will make use of a ratio-
scaled function. This type of function makes a positive
measurement on a non-linear scale, which will be, in this case, the
exponential scale. The used function is defined in the range [0, 1]
and is defined according to the equation
}
||
{
),(
Amp
BSS
yx
yx
eSSd
×−
−
=
(1)
In equation 3, S
x
and S
y
are the two variables under comparison,
B is a constant value, and Amp is the amplitude (difference
between the maximum and minimum value) of the respective
feature variable in all signatures space.
3.1.2 Similarity of Complex Feature Variable
Complex feature variables are defined by two co-dependent
variables. These variables correspond respectively to the average
and the standard deviation of the considered feature. For two
complex variables, C
x
= (M
x
,σ
x
,) and C
y
= (M
y
,σ
y
), the similarity
function is defined in equation 4, and is within the range [0, 1].
||
||
),(),(
yx
yx
yxyx
CC
CC
MMdCCd
∪
∩
×=
(2)
Equation 4 is the result of the combination of two formulas, the
similarity function for simple variables (eq. 3) and the
ratio
||
||
yx
yx
CC
CC
∪
∩
. This ratio is also within the range [0, 1] and
provides the overlap degree of the two complex feature variables
by measuring the intersection of the intervals [M
x
-σ
x
, M
x
+σ
x
]
and [M
y
-σ
y
, M
y
+σ
y
].
3.2 Calculating the Distance among
Signatures
Since the feature variables in the signature have different types,
each variable has to be evaluated according to a distinct sub-
function. Thus, the dist function is composed by the several sub-
functions: dist = θ(f
1
, f
2
,…, f
n
). Consider as an example the
simplification of a signature S = {(
µ
a
,σ
a
); µ
b
; µ
c
; (µ
d
,σ
d
)}, where
the first and the last feature variables are complex (calculated by
Eq. 2) and the second and the third are simple (calculated by Eq.
1) variables. Let C = {(
µ’
a
,σ’
a
); µ’
b
; µ’
c
; (µ’
d
,σ’
d
)}, be a
summary. Since we are interested in considering deviation
detection from a probabilistic point of view, i.e. the distance
measure among two signatures S and C, would therefore
correspond to the probability of C being different from S. The
proposed distance function can be presented as:
22
1111
),(...),(),(
nnnn
CSfCSfCSD ⋅++⋅=
αα
(3)
Different distance functions can be provided, by the fraud analyst,
by setting the weighing factors α
i
to different values. The use of
different distance functions will allow detecting deviations in
different scenarios. The overall distance function can be re-
defined as in 2.
Dist(S, C) = MAX{dist
1
(S, C), dist
2
(S, C),…,dist
m
(S, C)} (4)
If according to the distance function, a threshold value ε defined
by the analyst is exceeded, Dist(S, C) > ε, then an alarm should be
raised to future examination of the respective user. Otherwise, the
user is considered to be within its normal behavior.
3.3 Anomaly Detection Procedure
The anomaly detection procedure based on signature deviation
consists in several steps. It starts by a loading step, which imports
the information to the local database of the system. This
information refers to the signature and summary information of
each user. The signatures are imported only once, when the
system is started. All the signatures of a user are kept through
time. Such information will also be useful for posterior analysis.
A signature may have two different status "Active" or "Expired".
For each user only one signature can have the Active state, and it
is the most up to date one. The processing step is described by
algorithm presented in [5], and follows the previous equations for
calculating the distance and similarities among signatures.
According to equation (4), if an alarm is raised, the user is put on
a blacklist. This is performed on the triggering alarm step, which
is based on the calculation of the whole distance functions over
the signatures. At the end, all the raised alarms have to pass
through the analyst verification in order to determine if this alarm
corresponds or not to a fraud situation. The evaluation of the
alarms is supported by the interface of the system that employs
features of dashboard systems, providing a complete set of
valuable information [5].
3.4 Signature Updating
The updating process of the signatures follows the ideas presented
in [3]. The update of a signature S
t
in the instant t+1, S
t+1
, through
a set of processed CDRs (summary) C, is given by the formula:
S
t+1
= β.S
t
+ (1-β).C (5)
The constant β indicates the weight of the new actions C in the
values of the new signature. Depending on the size of the time
window ω this constant can be adjusted [3]. In contrast to the
system in [3], the value of signature is always updated. If the
Dist(S
t
, C) ≤ ε then the user is considered to have a normal
behavior. If Dist(S
t
, C) > ε then an alarm is triggered, nevertheless
the signature continues to be constantly updated. The reason for
this is that the alarm still needs to pass through the analysis of the
company fraud analyst. It could be the case in which the analyst
considers it as a false alarm. The continuous update of that user
signature avoids the loss of information that was gathered
between the moment when the alarm was triggered and the
moment the analyst gives his verdict.
4. CHANGING PATTERNS
4.1 Clustering Signatures
The analysis of changes in the clusters topology over a period of
time will provide valuable information for the better
understanding of the usage patterns of the telecommunications
services. In particular, the detection of abrupt changes in cluster
membership may provide strong evidences of a fraud situation.
We propose the application of dynamic clustering analysis
techniques over signature data. Our aim is that these changes will
also provide evidences to fraud analysts for establishing potential
fraud situations.
4.1.1 Similarity of Signatures
Signatures are composed of simple and complex variables.
Traditional similarity measures, like Euclidean distance, Pearson
correlation, Jaccard measure will not be applicable for signature
comparison. Therefore, we need to devise a new similarity
measure which will allow us to determine similarities among
signatures. We define the similarity between two signatures as the
combination of the variable similarity measures defined in section
3.1. For two signatures X and Y, where X
i
and Y
i
are respectively
the feature variable i of X and Y, and for n possible variables, the
similarity measure can be defined as in 6.
22
1111
),(...),(),(
nnnn
YXdWYXdWYXD ⋅++⋅=
(6)
D(X,Y) ∈ [0, 1] and W
i
defines the weight of the feature
and
∑
.With this signature similarity measure, we
can compare all signatures. This will provide a N x N matrix, that
summarizes the similarities among the N signatures. The
clustering solution can then be obtained by taking into account the
previous calculated matrix as the input.
=
=
n
k
Wi
1
1
4.2 Clustering Migration Analysis
According to the moment of the week, different usage patterns
can be found [6]. These usage profiles are provided by means of
signature clustering analysis, according to the method describe
previously in section 4.1. Therefore, for each day of the week a
cluster topology is provided. This topology describes customers'
usage patterns during that period. Each cluster is described by the
characteristics of its centroid. The centroid is defined as a
signature. This allows making direct comparisons of the
signatures and clusters centroid. The comparison is made
according to the similarity formula 6.The signature assignment to
the cluster is done by comparing each signature against each
cluster centroid, and it is assigned to the cluster in which has the
smallest distance.
4.2.1 Absolute and Relative Similarity
In order to make the comparison of signatures against cluster
centroids, two types of similarity measures can be defined:
absolute and relative similarity. Absolute similarity defines the
similarity value between the signature and the centroid in a given
time moment t. This value is calculated according to formula 6.
Relative similarity relates the absolute similarity between instant t
and t+1, providing the percentage of the signature variation
between two consecutive time instants. This value is obtained
through the formula:
[]
[]
%100}
,(
,(
1{
1
×−=∆
+
t
ii
t
ii
SSignClSD
SSignClSD
(7)
In formula 7, S
i
corresponds to a signature, and SignCl[S
i
] to the
cluster that S
i
belongs in the moment t. Figure 1 shows a positive
variation, where the signature S
i
is closer to the centroid 0 in the
instant t than in t+1.
Figure 1. Positive variation of the relative similarity of the
signature.
Figure 2. Negative variation of relative similarity of the
signature and change cluster membership.
A negative value of the relative similarity in the instant t+1,
indicates that the signature S
i
is now close to the centroid of the
cluster that it fits in the instant t. Nevertheless, we can detect to a
cluster membership change, since now S
i is now closer to another
cluster (cluster 1) (figure 2).
We define a cluster membership change as follows: a signature S
changes its cluster membership to cluster C
j
in the instant t+1, if it
belongs to cluster C
i
in the instant t, in the instant t+1 the distance
D(S,C
j
) is minimal concerning all clusters and D(S,C
j
)
t+1
<
D(S,C
i
)
t
. All the data relative to the cluster membership of the
signatures are kept for posterior analysis. These data, which we
call Historical data, will make possible to assess the evolution of
the customer behavior through time. In order to offer the analyst a
tool for a better examination of the changing behavior of the
customers, during a defined interval, analysis reports can be
generated [6]. This tool will provide the identification of all the
conditions used, as well as, the average and standard deviation of
the signatures variations and the maximum, minimum and
average values for all the signature feature variables. The
deviating signatures detected are included into a blacklist, for
further analysis.
Figure 3. Example of fraud situations in the blacklist.
Figure 3 shows an example of a real fraud situation detected by
our methods on the evaluation study. The first line contains a
header with the temporal reference, the analysis report
description, and the limits of the range [
µ-2σ, µ+2σ], which
indicates that any variation outside this limit is considered an
abnormal situation. For the next lines, it is listed the moment
when the anomaly was detected, the signature identification
(phone number), the cluster where the signature belongs, a flag
indicating a cluster membership change (1in the positive case),
the absolute similarity of the signature and the cluster, the relative
similarity (variation) and as the last column the description of the
respective analysis report. More detailed information about the
methods, as well as, scalability issues regarding its application
can be obtained in [5, 6].
5. EVALUATING REAL FRAUD
SITUATIONS
In order to assess the quality of our strategy methodology in
detecting anomalous behaviors, we have examined the data
correspondent to a week of voice calls from a Portuguese mobile
telecommunications network. The complete set of CDRs
corresponds to approximately 2.5 millions of records, and 700
thousand of signatures processed per day. Up to now, there isn’t
exists any accurate database with previous cases of fraud. Thus,
the settings of our methods were guided by a small list of 12
customers (fraudsters in the referenced week), provided by the
fraud analyst in order to detect other similar behaviors. In this
first stage of detecting anomalous situations, we are interested on
the effectiveness of our methods. Therefore, we worked on a
subset of the previous data concerning to a sample distribution
with approximately 5 thousands summaries per day and its
respective signatures to the whole week. The detection process
was carried out by applying the method described in section 3.
Several thresholds (ε) were used and basically four main distance
functions were designed combining different feature variables and
weights. An illustration of the alarms generated by the deviation-
based approach is given in table 2. Pay attention to the most right
(gray) column, further investigation on those alarms shown that
some of them were real fraud situations.
Table 2. Different thresholds (ε) and the alarms generated for
three particular days of the week.
(ε)/day
0.8 1.0 1.2 1.6
2.0
Tue
2141 649 139 50
25
Wed
3029 1145 251 103
56
Sat
1006 560 150 39
23
For getting more understanding under the circumstances in which
those alarms where generated one must investigate the impact of
each variable over the MAX distance function (Eq.4). In figure 4
we exemplify such evaluation by allowing top-k queries over the
complete set of alarms. We also verified that the most 10
imperative anomalous situations were raging from 2.76 up to 3.33
concerning its distance function. The feature variable which has
more impact over the distance calculation is the international call
(originated ones). On the other hand, in Figure 5 we can see that
workhours variable has great importance to the distance
calculation over the whole period.
Figure 4. The impact of each feature variable over the top-10
higher alarms.
Figure 5. An overall picture of feature variable distribution
over the max distance (ε ≥ 2).
It is important to mention that both methods provide just insights
that could be recognized as anomalous situations. In fact, the
characteristics of the data provided by the analysts don’t allow us
to apply any classification technique. Therefore, it is quite hard to
evaluate, precisely, the rates for false positive and false negatives.
Although, given the small list provided by analyst we can report a
recall of 75%.
In order to complement the previous results we further make use
of a dynamic clustering approach to detect suspect changes on
cluster membership over the whole week. The identification of
those changes will trigger alarms for future inspection. After
several executions, the qualities of the clusters were maximized
with 8 clusters. The distribution of the alarms raised by this
method can be figured out in table 3.
Table 3. Alarms raised per cluster for three particular days of
the week.
Cluster
Tue Wed Sat
1
3 9 1
2
9 7 123
3
3 12 71
4
5 17 16
5
23 21 22
6
20 31 40
7
8 11 26
8
52 72 0
The bottom (gray) line in table 3 shows the cluster with the
highest number of calls. Figure 6 shows an example of changing
on cluster membership, which represents a real fraud situation
identified by this method. The first and second customers pass
from a cluster (1 and 2) with a lower average of number of calls
to the cluster with the highest number of calls (8), in days 4 and 3
respective. The third customer in this example, although always in
the same cluster, has registered a significant variation between
days between days 5 and 6.
Figure 6. Example of anomaly situations regarded to the
increase in the number of calls.
By using dynamic clustering we can now report a recall of 91%.
As one can see this method is a little bit susceptible for detecting
anomalous situations than the previous one. This is explained by
the relative similarity measure (Eq. 7) which provides a fine
tuning of the clustering migration method by exploring signatures
relative variation over the time (whole week). Finally, the overlap
rate of both methods corresponds to approximately 62% for the
whole sample used, and 66% for the blacklist provided by the
fraud analyst. Meanwhile, the remaining cases, other anomalous
situations with the same behavior of the previous cases detected,
are under inspection by the company analysts. Thus, the next
efforts will be heading to the development of a database of fraud
cases, as well as, an induction rule engine to help analyst on the
evaluation of the alarms.
Concerning the scalability issues preliminary results showed to us
that the most costly step is the calculation of the summaries and
signatures. It requires several aggregations functions over CDRs
records with the purpose of grouping information by each
customer. At this time, this is done by several SQL scripts over a
Microsoft SQL Server 2005. By the time that this information is
available we can make use of each method discussed in this work
without pre-defined order to detect anomalies. When dealing with
such huge data we have realized that working with chunks of
information (summaries and signatures) plus clustered indexes
structures, it improves the processing time without losing quality
of the results by at least one order of magnitude. On the other
hand adds a new trouble, in sense that, when sliding the window
from ω to ω+1 requires rebuilding of the all respective indexes.
Finally, in case of using dynamic clustering we have divide the
original chunk of data D, into a set of partitions D’
i
, mutually
exclusive, in order to make the processing of each partition
feasible. After all partitions have been processed, the last step is
to merge all the clustering information resulted from each chunk
processed. The parameters that described the cluster topology
obtained for each block are gathered in a unique set of D’
f
. These
parameters are considered the data objects for further processing
of the final K clusters obtained. In a future work, we intend to
report several scenarios of utilization and optimization of the both
elements discussed in this work for detecting anomalous
situations.
6. FINAL DISCUSSION
In this work we have presented two methods for detecting telecom
fraud situations. Both methods rely on the concept of signature to
summarize the customer behavior through a certain period of
time. In the first approach, the user signature is used as a
comparison basis. A possible differentiation between the actual
behavior of the user and its signature may reveal an abnormal
situation. The second approach uses dynamic clustering analysis
in order to evaluate changes on cluster membership over the time.
The clear basis of these detection-based methods is that they
complement each other on reporting anomalous situations. For
instance in section 5 we show an overlapping of 66% fraud
situations which was raised by the proposed methods. The
experimental evaluation performed with data from a week of
voice calls, and respective comparison, with a list of previously
detected fraud cases, allowed us to conclude about the high rate of
true positives (91%) detected by the proposed methods.
Additionally, they discovered other fraud situations which were
not reported previously by the analysts. Preliminary discussion
with fraud analysts gave us feedback about the promising
capabilities of the proposed methodologies.
7. REFERENCES
[1] Y. Kou, T. Lu S. Sirwongwattana, and Y. Huang. Survey of
fraud detection techniques. In Proceedings of IEEE Intl
Conference on Networking, Sensing and Control, March
2004.
[2] T.F. Lunt. A survey of intrusion detection techniques.
Computer and Security, (53):405-418, 1999.
[3] Corrina Cortes and Daryl Pregibon. Signature-based methods
for data streams. Data Mining and Knowledge Discovery,
(5):167-182, 2001.
[4] Myers and Myers. Probability and Statistics for Engineers
and Scientists. Prentice Hall, 6th edition.
[5] Pedro Ferreira, Ronnie Alves, Orlando Belo and Luís
Cortesão. Establishing Fraud Detection Patterns Based on
Signatures. In Proceedings of Industrial Conference on Data
Mining´2006, July, 2006.
[6] Pedro Ferreira, Orlando Belo, Ronnie Alves, and Joel
Ribeiro. Fratelo - Fraud in Telecommunications: Technical
report. Tech Report 1, University of Minho, Department of
Informatics, May 2006.