Content uploaded by Orlando Belo

Author content

All content in this area was uploaded by Orlando Belo

Content may be subject to copyright.

Discovering Telecom Fraud Situations through Mining

Anomalous Behavior Patterns

Ronnie Alves, Pedro Ferreira,

Orlando Belo, Joao Lopes, Joel

Ribeiro

University of Minho

Campus de Gualtar

4710-057 Braga

PORTUGAL

{ronnie, pedrogabriel,

obelo}@di.uminho.pt

Luís Cortesão

Portugal Telecom Inovação, SA,

Rua Eng. José Ferreira Pinto Basto

3810-106 Aveiro

PORTUGAL

lcorte@ptinovacao.pt

Filipe Martins

Telbit, Lda,

Rua Banda da Amizade, 38

3810-059 Aveiro

PORTUGAL

fmartins@telbit.pt

ABSTRACT

In this paper we tackle the problem of superimposed fraud

detection in telecommunication systems. We propose two

anomaly detection methods based on the concept of signatures.

The first method relies on a signature deviation-based approach

while the second on a dynamic clustering analysis. Experiments

carried out with real data, voice call records from an entire week,

corresponding to approximately 2.5 millions of CDRs and 700

thousand of signatures processed per day, allowed us to detect

several anomalous situations. The frauds analysts provide us a

small list of 12 customers for whom a fraudulent behavior was

detected during this week. Thus, 9 and 11 fraud situations were

discovered from each method respectively. Preliminary results

and discussion with fraud analysts has already proved that our

methods are a valuable tool to assist them in fraud detection.

1. INTRODUCTION

In superimposed fraud situations, the fraudsters make an

illegitimate use of a legitimate account by different means. In this

case, some abnormal usage is blurred into the characteristic usage

of the account. This type of fraud is usually more difficult to

detect and poses a bigger challenge to the telecommunications

companies. Telecommunications companies use since the 90's

decade several kinds of approaches based on statistical analysis

and heuristics methods to assist them in the detection and

categorization of fraud situations. Recently, they have been

adopting the use and exploitation of data mining and knowledge

discovery techniques for this task. In this paper we tackle the

problem of superimposed fraud detection in telecommunication

systems. Two methods for discovering fraud situations through

mining anomalous customers’ behavior patterns are presented.

These methods are based on the concept of signature [3], which

has already been used successfully for anomalous detection in

many areas like credit card usage [1], network intrusion [2] and in

particular in telecommunications fraud [3]. Our goal was to detect

deviate behaviors in useful time, giving better basis to analysts to

be more accurate in their decisions in the establishment of

potential fraud situations.

2. THE ROLE OF SIGNATURES ON

DETECTING FRAUD

Our technique has as a core concept on the notion of signature.

We emphasize the work of Cortes and Pregibon [3], since it was

the main inspiration for the use of signatures. We have redefined

their notion of signature. A signature of a user corresponds to a

vector of feature variables whose values are determined during a

certain period of time. The variables can be simple, if they consist

into a unique atomic value (ex: integer or real) or complex, if they

consist in two co-dependent statistical values, typically the

average and the standard deviation of a given feature.

Table 1. Description of the fv used in signature and summary.

Description Type

Duration of Calls Complex

N. of Calls – Working Days Complex

N. of Calls – Weekends and Holidays Complex

N. of Calls – Working Time (8h-20h) Complex

N. of Calls – Night Time (20h-8h) Complex

N. of Calls to Diff. National Networks Simple

N. of Calls as Caller (Origin) Simple

N. of Calls as Called (Destination) Simple

N. of International Calls Simple

N. of Calls as Caller in Roaming Simple

N. of Calls as Called in Roaming Simple

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

DMBA'06, August 20, 2006, Philadelphia, Pennsylvania, USA.

Copyright 2006 ACM 1-59593-439-1...$5.00

The choice of the type of the variables depends on several factors,

like the complexity of the feature described or the data available

to perform such calculation. A feature like the duration of the

calls shows a significant variability which is much better

expressed through an average(µ)/standard-deviation(σ)

parameter. A feature like the number of international calls is

typically much less frequent and thus an average value is

sufficient to describe it. In table 1 we list the complete set of

feature variables (fv) used in the context of this work. A signature

S is then obtained from a function φ for a given temporal window

ω, where S = φ(ω). We consider a time unit, the amount of time in

which the CDRs are accumulated and that in the end of this

period are processed. A summary C, has the same information

structure as a signature, but it is used to resume the user behavior

in a smaller time period. Typically, a signature reflects the usage

patterns for a period of a

week, a month or even half year,

whereas a summary reflects the periods of an hour, a half

day or complete day. In this work, we considered the

period of one day for a summary and a week for the

signature.

3. DEVIATING PATTERNS

3.1 Evaluating Similarities among Signatures

3.1.1 Similarity of Simple Feature Variables

A simple feature is defined by a unique variable, which

corresponds to the average value of the considered feature. For

simple feature variable comparison we will make use of a ratio-

scaled function. This type of function makes a positive

measurement on a non-linear scale, which will be, in this case, the

exponential scale. The used function is defined in the range [0, 1]

and is defined according to the equation

}

||

{

),(

Amp

BSS

yx

yx

eSSd

×−

−

=

(1)

In equation 3, S

x

and S

y

are the two variables under comparison,

B is a constant value, and Amp is the amplitude (difference

between the maximum and minimum value) of the respective

feature variable in all signatures space.

3.1.2 Similarity of Complex Feature Variable

Complex feature variables are defined by two co-dependent

variables. These variables correspond respectively to the average

and the standard deviation of the considered feature. For two

complex variables, C

x

= (M

x

,σ

x

,) and C

y

= (M

y

,σ

y

), the similarity

function is defined in equation 4, and is within the range [0, 1].

||

||

),(),(

yx

yx

yxyx

CC

CC

MMdCCd

∪

∩

×=

(2)

Equation 4 is the result of the combination of two formulas, the

similarity function for simple variables (eq. 3) and the

ratio

||

||

yx

yx

CC

CC

∪

∩

. This ratio is also within the range [0, 1] and

provides the overlap degree of the two complex feature variables

by measuring the intersection of the intervals [M

x

-σ

x

, M

x

+σ

x

]

and [M

y

-σ

y

, M

y

+σ

y

].

3.2 Calculating the Distance among

Signatures

Since the feature variables in the signature have different types,

each variable has to be evaluated according to a distinct sub-

function. Thus, the dist function is composed by the several sub-

functions: dist = θ(f

1

, f

2

,…, f

n

). Consider as an example the

simplification of a signature S = {(

µ

a

,σ

a

); µ

b

; µ

c

; (µ

d

,σ

d

)}, where

the first and the last feature variables are complex (calculated by

Eq. 2) and the second and the third are simple (calculated by Eq.

1) variables. Let C = {(

µ’

a

,σ’

a

); µ’

b

; µ’

c

; (µ’

d

,σ’

d

)}, be a

summary. Since we are interested in considering deviation

detection from a probabilistic point of view, i.e. the distance

measure among two signatures S and C, would therefore

correspond to the probability of C being different from S. The

proposed distance function can be presented as:

22

1111

),(...),(),(

nnnn

CSfCSfCSD ⋅++⋅=

αα

(3)

Different distance functions can be provided, by the fraud analyst,

by setting the weighing factors α

i

to different values. The use of

different distance functions will allow detecting deviations in

different scenarios. The overall distance function can be re-

defined as in 2.

Dist(S, C) = MAX{dist

1

(S, C), dist

2

(S, C),…,dist

m

(S, C)} (4)

If according to the distance function, a threshold value ε defined

by the analyst is exceeded, Dist(S, C) > ε, then an alarm should be

raised to future examination of the respective user. Otherwise, the

user is considered to be within its normal behavior.

3.3 Anomaly Detection Procedure

The anomaly detection procedure based on signature deviation

consists in several steps. It starts by a loading step, which imports

the information to the local database of the system. This

information refers to the signature and summary information of

each user. The signatures are imported only once, when the

system is started. All the signatures of a user are kept through

time. Such information will also be useful for posterior analysis.

A signature may have two different status "Active" or "Expired".

For each user only one signature can have the Active state, and it

is the most up to date one. The processing step is described by

algorithm presented in [5], and follows the previous equations for

calculating the distance and similarities among signatures.

According to equation (4), if an alarm is raised, the user is put on

a blacklist. This is performed on the triggering alarm step, which

is based on the calculation of the whole distance functions over

the signatures. At the end, all the raised alarms have to pass

through the analyst verification in order to determine if this alarm

corresponds or not to a fraud situation. The evaluation of the

alarms is supported by the interface of the system that employs

features of dashboard systems, providing a complete set of

valuable information [5].

3.4 Signature Updating

The updating process of the signatures follows the ideas presented

in [3]. The update of a signature S

t

in the instant t+1, S

t+1

, through

a set of processed CDRs (summary) C, is given by the formula:

S

t+1

= β.S

t

+ (1-β).C (5)

The constant β indicates the weight of the new actions C in the

values of the new signature. Depending on the size of the time

window ω this constant can be adjusted [3]. In contrast to the

system in [3], the value of signature is always updated. If the

Dist(S

t

, C) ≤ ε then the user is considered to have a normal

behavior. If Dist(S

t

, C) > ε then an alarm is triggered, nevertheless

the signature continues to be constantly updated. The reason for

this is that the alarm still needs to pass through the analysis of the

company fraud analyst. It could be the case in which the analyst

considers it as a false alarm. The continuous update of that user

signature avoids the loss of information that was gathered

between the moment when the alarm was triggered and the

moment the analyst gives his verdict.

4. CHANGING PATTERNS

4.1 Clustering Signatures

The analysis of changes in the clusters topology over a period of

time will provide valuable information for the better

understanding of the usage patterns of the telecommunications

services. In particular, the detection of abrupt changes in cluster

membership may provide strong evidences of a fraud situation.

We propose the application of dynamic clustering analysis

techniques over signature data. Our aim is that these changes will

also provide evidences to fraud analysts for establishing potential

fraud situations.

4.1.1 Similarity of Signatures

Signatures are composed of simple and complex variables.

Traditional similarity measures, like Euclidean distance, Pearson

correlation, Jaccard measure will not be applicable for signature

comparison. Therefore, we need to devise a new similarity

measure which will allow us to determine similarities among

signatures. We define the similarity between two signatures as the

combination of the variable similarity measures defined in section

3.1. For two signatures X and Y, where X

i

and Y

i

are respectively

the feature variable i of X and Y, and for n possible variables, the

similarity measure can be defined as in 6.

22

1111

),(...),(),(

nnnn

YXdWYXdWYXD ⋅++⋅=

(6)

D(X,Y) ∈ [0, 1] and W

i

defines the weight of the feature

and

∑

.With this signature similarity measure, we

can compare all signatures. This will provide a N x N matrix, that

summarizes the similarities among the N signatures. The

clustering solution can then be obtained by taking into account the

previous calculated matrix as the input.

=

=

n

k

Wi

1

1

4.2 Clustering Migration Analysis

According to the moment of the week, different usage patterns

can be found [6]. These usage profiles are provided by means of

signature clustering analysis, according to the method describe

previously in section 4.1. Therefore, for each day of the week a

cluster topology is provided. This topology describes customers'

usage patterns during that period. Each cluster is described by the

characteristics of its centroid. The centroid is defined as a

signature. This allows making direct comparisons of the

signatures and clusters centroid. The comparison is made

according to the similarity formula 6.The signature assignment to

the cluster is done by comparing each signature against each

cluster centroid, and it is assigned to the cluster in which has the

smallest distance.

4.2.1 Absolute and Relative Similarity

In order to make the comparison of signatures against cluster

centroids, two types of similarity measures can be defined:

absolute and relative similarity. Absolute similarity defines the

similarity value between the signature and the centroid in a given

time moment t. This value is calculated according to formula 6.

Relative similarity relates the absolute similarity between instant t

and t+1, providing the percentage of the signature variation

between two consecutive time instants. This value is obtained

through the formula:

[]

[]

%100}

,(

,(

1{

1

×−=∆

+

t

ii

t

ii

SSignClSD

SSignClSD

(7)

In formula 7, S

i

corresponds to a signature, and SignCl[S

i

] to the

cluster that S

i

belongs in the moment t. Figure 1 shows a positive

variation, where the signature S

i

is closer to the centroid 0 in the

instant t than in t+1.

Figure 1. Positive variation of the relative similarity of the

signature.

Figure 2. Negative variation of relative similarity of the

signature and change cluster membership.

A negative value of the relative similarity in the instant t+1,

indicates that the signature S

i

is now close to the centroid of the

cluster that it fits in the instant t. Nevertheless, we can detect to a

cluster membership change, since now S

i is now closer to another

cluster (cluster 1) (figure 2).

We define a cluster membership change as follows: a signature S

changes its cluster membership to cluster C

j

in the instant t+1, if it

belongs to cluster C

i

in the instant t, in the instant t+1 the distance

D(S,C

j

) is minimal concerning all clusters and D(S,C

j

)

t+1

<

D(S,C

i

)

t

. All the data relative to the cluster membership of the

signatures are kept for posterior analysis. These data, which we

call Historical data, will make possible to assess the evolution of

the customer behavior through time. In order to offer the analyst a

tool for a better examination of the changing behavior of the

customers, during a defined interval, analysis reports can be

generated [6]. This tool will provide the identification of all the

conditions used, as well as, the average and standard deviation of

the signatures variations and the maximum, minimum and

average values for all the signature feature variables. The

deviating signatures detected are included into a blacklist, for

further analysis.

Figure 3. Example of fraud situations in the blacklist.

Figure 3 shows an example of a real fraud situation detected by

our methods on the evaluation study. The first line contains a

header with the temporal reference, the analysis report

description, and the limits of the range [

µ-2σ, µ+2σ], which

indicates that any variation outside this limit is considered an

abnormal situation. For the next lines, it is listed the moment

when the anomaly was detected, the signature identification

(phone number), the cluster where the signature belongs, a flag

indicating a cluster membership change (1in the positive case),

the absolute similarity of the signature and the cluster, the relative

similarity (variation) and as the last column the description of the

respective analysis report. More detailed information about the

methods, as well as, scalability issues regarding its application

can be obtained in [5, 6].

5. EVALUATING REAL FRAUD

SITUATIONS

In order to assess the quality of our strategy methodology in

detecting anomalous behaviors, we have examined the data

correspondent to a week of voice calls from a Portuguese mobile

telecommunications network. The complete set of CDRs

corresponds to approximately 2.5 millions of records, and 700

thousand of signatures processed per day. Up to now, there isn’t

exists any accurate database with previous cases of fraud. Thus,

the settings of our methods were guided by a small list of 12

customers (fraudsters in the referenced week), provided by the

fraud analyst in order to detect other similar behaviors. In this

first stage of detecting anomalous situations, we are interested on

the effectiveness of our methods. Therefore, we worked on a

subset of the previous data concerning to a sample distribution

with approximately 5 thousands summaries per day and its

respective signatures to the whole week. The detection process

was carried out by applying the method described in section 3.

Several thresholds (ε) were used and basically four main distance

functions were designed combining different feature variables and

weights. An illustration of the alarms generated by the deviation-

based approach is given in table 2. Pay attention to the most right

(gray) column, further investigation on those alarms shown that

some of them were real fraud situations.

Table 2. Different thresholds (ε) and the alarms generated for

three particular days of the week.

(ε)/day

0.8 1.0 1.2 1.6

2.0

Tue

2141 649 139 50

25

Wed

3029 1145 251 103

56

Sat

1006 560 150 39

23

For getting more understanding under the circumstances in which

those alarms where generated one must investigate the impact of

each variable over the MAX distance function (Eq.4). In figure 4

we exemplify such evaluation by allowing top-k queries over the

complete set of alarms. We also verified that the most 10

imperative anomalous situations were raging from 2.76 up to 3.33

concerning its distance function. The feature variable which has

more impact over the distance calculation is the international call

(originated ones). On the other hand, in Figure 5 we can see that

workhours variable has great importance to the distance

calculation over the whole period.

Figure 4. The impact of each feature variable over the top-10

higher alarms.

Figure 5. An overall picture of feature variable distribution

over the max distance (ε ≥ 2).

It is important to mention that both methods provide just insights

that could be recognized as anomalous situations. In fact, the

characteristics of the data provided by the analysts don’t allow us

to apply any classification technique. Therefore, it is quite hard to

evaluate, precisely, the rates for false positive and false negatives.

Although, given the small list provided by analyst we can report a

recall of 75%.

In order to complement the previous results we further make use

of a dynamic clustering approach to detect suspect changes on

cluster membership over the whole week. The identification of

those changes will trigger alarms for future inspection. After

several executions, the qualities of the clusters were maximized

with 8 clusters. The distribution of the alarms raised by this

method can be figured out in table 3.

Table 3. Alarms raised per cluster for three particular days of

the week.

Cluster

Tue Wed Sat

1

3 9 1

2

9 7 123

3

3 12 71

4

5 17 16

5

23 21 22

6

20 31 40

7

8 11 26

8

52 72 0

The bottom (gray) line in table 3 shows the cluster with the

highest number of calls. Figure 6 shows an example of changing

on cluster membership, which represents a real fraud situation

identified by this method. The first and second customers pass

from a cluster (1 and 2) with a lower average of number of calls

to the cluster with the highest number of calls (8), in days 4 and 3

respective. The third customer in this example, although always in

the same cluster, has registered a significant variation between

days between days 5 and 6.

Figure 6. Example of anomaly situations regarded to the

increase in the number of calls.

By using dynamic clustering we can now report a recall of 91%.

As one can see this method is a little bit susceptible for detecting

anomalous situations than the previous one. This is explained by

the relative similarity measure (Eq. 7) which provides a fine

tuning of the clustering migration method by exploring signatures

relative variation over the time (whole week). Finally, the overlap

rate of both methods corresponds to approximately 62% for the

whole sample used, and 66% for the blacklist provided by the

fraud analyst. Meanwhile, the remaining cases, other anomalous

situations with the same behavior of the previous cases detected,

are under inspection by the company analysts. Thus, the next

efforts will be heading to the development of a database of fraud

cases, as well as, an induction rule engine to help analyst on the

evaluation of the alarms.

Concerning the scalability issues preliminary results showed to us

that the most costly step is the calculation of the summaries and

signatures. It requires several aggregations functions over CDRs

records with the purpose of grouping information by each

customer. At this time, this is done by several SQL scripts over a

Microsoft SQL Server 2005. By the time that this information is

available we can make use of each method discussed in this work

without pre-defined order to detect anomalies. When dealing with

such huge data we have realized that working with chunks of

information (summaries and signatures) plus clustered indexes

structures, it improves the processing time without losing quality

of the results by at least one order of magnitude. On the other

hand adds a new trouble, in sense that, when sliding the window

from ω to ω+1 requires rebuilding of the all respective indexes.

Finally, in case of using dynamic clustering we have divide the

original chunk of data D, into a set of partitions D’

i

, mutually

exclusive, in order to make the processing of each partition

feasible. After all partitions have been processed, the last step is

to merge all the clustering information resulted from each chunk

processed. The parameters that described the cluster topology

obtained for each block are gathered in a unique set of D’

f

. These

parameters are considered the data objects for further processing

of the final K clusters obtained. In a future work, we intend to

report several scenarios of utilization and optimization of the both

elements discussed in this work for detecting anomalous

situations.

6. FINAL DISCUSSION

In this work we have presented two methods for detecting telecom

fraud situations. Both methods rely on the concept of signature to

summarize the customer behavior through a certain period of

time. In the first approach, the user signature is used as a

comparison basis. A possible differentiation between the actual

behavior of the user and its signature may reveal an abnormal

situation. The second approach uses dynamic clustering analysis

in order to evaluate changes on cluster membership over the time.

The clear basis of these detection-based methods is that they

complement each other on reporting anomalous situations. For

instance in section 5 we show an overlapping of 66% fraud

situations which was raised by the proposed methods. The

experimental evaluation performed with data from a week of

voice calls, and respective comparison, with a list of previously

detected fraud cases, allowed us to conclude about the high rate of

true positives (91%) detected by the proposed methods.

Additionally, they discovered other fraud situations which were

not reported previously by the analysts. Preliminary discussion

with fraud analysts gave us feedback about the promising

capabilities of the proposed methodologies.

7. REFERENCES

[1] Y. Kou, T. Lu S. Sirwongwattana, and Y. Huang. Survey of

fraud detection techniques. In Proceedings of IEEE Intl

Conference on Networking, Sensing and Control, March

2004.

[2] T.F. Lunt. A survey of intrusion detection techniques.

Computer and Security, (53):405-418, 1999.

[3] Corrina Cortes and Daryl Pregibon. Signature-based methods

for data streams. Data Mining and Knowledge Discovery,

(5):167-182, 2001.

[4] Myers and Myers. Probability and Statistics for Engineers

and Scientists. Prentice Hall, 6th edition.

[5] Pedro Ferreira, Ronnie Alves, Orlando Belo and Luís

Cortesão. Establishing Fraud Detection Patterns Based on

Signatures. In Proceedings of Industrial Conference on Data

Mining´2006, July, 2006.

[6] Pedro Ferreira, Orlando Belo, Ronnie Alves, and Joel

Ribeiro. Fratelo - Fraud in Telecommunications: Technical

report. Tech Report 1, University of Minho, Department of

Informatics, May 2006.