Conference Paper

DisTec: Towards a Distributed System for Telecom Computing

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The continued exponential growth in both the volume and the complexity of information, compared with the computing capacity of the silicon-based devices restricted by Moore’s Law, is giving birth to a new challenge to the specific requirements of analysts, researchers and intelligence providers. With respect to this challenge, a new class of techniques and computing platforms, such as Map-Reduce model, which mainly focus on scalability and parallelism, has been emerging. In this paper, to move the scientific prototype forward to practice, we elaborate a prototype of our applied distributed system, DisTec, for knowledge discovery from social network perspective in the field of telecommunications. The major infrastructure is constructed on Hadoop, an open-source counterpart of Google’s Map-Reduce. We carefully devised our system to undertake the mining tasks in terabytes call records. To illustrate its functionality, DisTec is applied to real-world large-scale telecom dataset. The experiments range from initial raw data preprocessing to final knowledge extraction. We demonstrate that our system has a good performance in such cloud-scale data computing.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... DisTec [3] is another SNA solution for marketing purposes. It is based on parallel and distributed computing architecture. ...
Conference Paper
Full-text available
Call Detail Records (CDR) is a valuable source of information; it opens new opportunities for telecom industry and maximize its revenues as well as it helps the community to raise its standard of living in many different ways. However, we need to analyze CDRs in order to extract its big value. But CDRs has a huge volume, variety of data and high data rate, while current telecom systems are designed without these issues in mind. CDRs can be seen as Big Data source, and hence, it is applicable to use Big Data technologies (storage, processing and analysis) in CDR analytics. There are considerable research efforts to address the CDR analysis challenges. This paper presents the use of Big Data technology in CDR analysis by giving some CDR analytics based application examples, highlighting their architecture, the utilized Big Data tools and techniques, and the CDR use case scenarios.
Article
With the dramatic rise of mobile internet users and the administrative requirements of long-term data retention, telecom providers are facing increasingly challenging storage and retrieval issues of call detail records (CDRs). The existing storage system can only achieve the requirement of online query and offline analysis of the CDRs. However, to the best of our knowledge, few studies have focused on the topic of CDRs retrieval optimization with long-term storage. In order to improve the retrieval speed while ensuring a high compression ratio, in this paper we propose a novel hash storage scheme, termed dual-column bucketing (DCB), based on the Hive platform by making use of its Bucketing nature. Compared to the conventional scheme, the proposed DCB scheme can improve the performance both for CDRs compression and query. Second, similar storage scenarios such as storage of SMS, email and extended detail records (XDRs) are included in the optimization scope of the DCB. Experiments on real-world CDRs show that in contrast to the conventional scheme, the proposed DCB scheme can save the storage space by approximately 40%, reduces the amount of disk read to 2%, and improve the retrieval speed of known phone number queries by up to seven times.
Conference Paper
Cloud adoption is a critical strategic decision for many organizations. The high growth potential of the cloud computing services market attract telecom service providers into this new investment area for sustainable revenue growth, as it offers a means to supplement the declining traditional services revenues as well as profit from their core businesses such as voice and messaging. Telecom service providers can play many potential roles in cloud ecosystem including those of a cloud customer, enabler, partner and/or broker by using their inherent advantages that give them an edge over other traditional cloud computing service providers. There are two main reasons why telecom service providers should consider becoming engaged in cloud computing- to reap the benefits of cloud computing for IT optimization and to exploit new business opportunities. This work is focused on potential impact of cloud computing on operational performance in a telecom contact center through IT optimization.
Article
Telecommunication data analysis has been often used as a background application to motivate many problems. However, traditional analysis algorithms meet new challenges, as the continued exponential growth in both the volume and the complexity of telecom data. With respect to this challenge, a new class of techniques and computing framework, such as MapReduce model, which mainly put focus on scalability and parallelism, has been emerging. In this paper, we present our applied cloud-based system, TeleDatA, which combines data mining, social network analysis and statistics analysis with MapReduce framework, for knowledge discovery in telecommunications. As a full functionality system, it provides data-flow oriented preprocessing utilities, chain engine, expression evaluation engine and core analysis algorithms that are implemented by using our new computing model with MapReduce, which makes TeleDatA have the ability to handle tera even peta-scale data in Telecom industry. More importantly, TeleDatA is applied to a real-world telecom data by elaborating several application scenarios and it has a good scalability, effectiveness and efficiency.
Conference Paper
Data analysis has been widely used in the enterprises for its high efficiency and accuracy, especially in the field of telecommunication industry, such as User Behavior Analysis, Customer Churn Prediction, etc. However, as the exponential growth of data, traditional data analysis tools can not handle such large-scale dataset. Furthermore, as business gets more and more complicated, there is much more requirement for integration of different data analysis tools. On the other hand, traditional analysis tools lack of visualization, which makes the result hard to understand. We propose a distributed system named SAKU, which resolves those problems. In this paper, we implement some algorithms using mapreduce framework in order to process large-scale data. We also discuss every part of the system. Furthermore, we come up with a new report framework based on cloud computing for visualization of large- scale data. The most important thing is, we apply this system into a scenario which meets real-world requirements by using a large volume of data obtained from the telecom operators, which demonstrates high efficiency and scalability of the system. In this paper, we come up with a distributed system named SAKU based on cloud computing to solve problems traditional data analysis tools face. We will focus on the data mining and graph ming alogrithms implemented on mapreduce framework to handle large-scale dataset. And we will discuss the way of integrating data analysis tools on workflow. We also propose a report framework, which is able to handle and display large- scale data. In summary, SAKU takes the following as its new features and advantages, which is our contribution to this work: • Parallelized algorithms. We implement several algorithms which is based on mapreduce framework. In addition, we validate that these algorithms are nearly linear speedup ratio, which makes it suitable for large- scale data.
Conference Paper
Full-text available
With ever growing competition in telecommunications markets, operators have to increasingly rely on business intelligence to offer the right incentives to their customers. Toward this end, existing approaches have almost solely focussed on the individual behaviour of customers. Call graphs, that is, graphs induced by people calling each other, can allow telecom operators to better understand the interaction behaviour of their customers, and potentially provide major insights for designing effective incentives.In this paper, we use the Call Detail Records of a mobile operator from four geographically disparate regions to construct call graphs, and analyse their structural properties. Our findings provide business insights and help devise strategies for Mobile Telecom operators. Another goal of this paper is to identify the shape of such graphs. In order to do so, we extend the well-known reachability analysis approach with some of our own techniques to reveal the shape of such massive graphs. Based on our analysis, we introduce the Treasure-Hunt model to describe the shape of mobile call graphs. The proposed techniques are general enough for analysing any large graph. Finally, how well the proposed model captures the shape of other mobile call graphs needs to be the subject of future studies.
Conference Paper
Full-text available
Social Network Analysis has emerged as a key paradigm in modern sociology, technology, and information sciences. The paradigm stems from the view that the attributes of an individual in a network are less important than their ties (relationships) with other individuals in the network. Exploring the nature and strength of these ties can help understand the structure and dynamics of social networks and explain real-world phenomena, ranging from organizational efficiency to the spread of information and disease. In this paper, we examine the communication patterns of millions of mobile phone users, allowing us to study the underlying social network in a large-scale communication network. Our primary goal is to address the role of social ties in the formation and growth of groups, or communities, in a mobile network. In particular, we study the 'evolution of churners in an operator's network spanning over a period of four months. Our analysis explores the propensity of a subscriber to churn out of a service provider's network depending on the number of ties (friends) that have already churned. Based on our findings, we propose a spreading activation-based technique that predicts potential churners by examining the current set of churners and their underlying social network. The efficiency of the prediction is expressed as a lift curve, which indicates the fraction of all churners that can be caught when a certain fraction of subscribers were contacted.
Conference Paper
Full-text available
MapReduce is emerging as an important programming model for large-scale data-parallel applications such as web indexing, data mining, and scientific simulation. Hadoop is an open-source implementation of MapRe- duce enjoying wide adoption and is often used for short jobs where low response time is critical. Hadoop's per- formance is closely tied to its task scheduler, which im- plicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that ap- pear to be stragglers. In practice, the homogeneity as- sumptions do not always hold. An especially compelling setting where this occurs is a virtualized data center, such as Amazon's Elastic Compute Cloud (EC2). We show that Hadoop's scheduler can cause severe performance degradation in heterogeneous environments. We design a new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity. LATE can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.
Article
Full-text available
CLOUD COMPUTING, the long-held dream of computing as a utility, has the potential to transform a large part of the IT industry, making software even more attractive as a service and shaping the way IT hardware is designed and purchased. Developers with innovative ideas for new Internet services no longer require the large capital outlays in hardware to deploy their service or the human expense to operate it. They need not be concerned about overprovisioning for a service whose popularity does not meet their predictions, thus wasting costly resources, or underprovisioning for one that becomes wildly popular, thus missing potential customers and revenue. Moreover, companies with large batch-oriented tasks can get results as quickly as their programs can scale, since using 1,000 servers for one hour costs no more than using one server for 1,000.
Article
Full-text available
Systems as diverse as genetic networks or the world wide web are best described as networks with complex topology. A common property of many large networks is that the vertex connectivities follow a scale-free power-law distribution. This feature is found to be a consequence of the two generic mechanisms that networks expand continuously by the addition of new vertices, and new vertices attach preferentially to already well connected sites. A model based on these two ingredients reproduces the observed stationary scale-free distributions, indicating that the development of large networks is governed by robust self-organizing phenomena that go beyond the particulars of the individual systems.
Article
Full-text available
Electronic databases, from phone to e-mails logs, currently provide detailed records of human communication patterns, offering novel avenues to map and explore the structure of social and communication networks. Here we examine the communication patterns of millions of mobile phone users, allowing us to simultaneously study the local and the global structure of a society-wide communication network. We observe a coupling between interaction strengths and the network's local structure, with the counterintuitive consequence that social networks are robust to the removal of the strong ties but fall apart after a phase transition if the weak ties are removed. We show that this coupling significantly slows the diffusion process, resulting in dynamic trapping of information in communities and find that, when it comes to information diffusion, weak and strong ties are both simultaneously ineffective. • complex systems • complex networks • diffusion and spreading • phase transition • social systems
Article
Abstract Huge datasets are becoming,prevalent; even as re- searchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world ap- plications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy- to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map- Reduce is a simple yet powerful execution engine, which can be complemented,with other data storage and manage- ment components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the Distributed Co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity,hardware.
Article
Data intensive computing facilitates human understanding of complex problems that must process massive amounts of data. Through the development of new classes of software, algorithms and hardware, data intensive applications provide timely and meaningful analytical results in response to exponentially growing data complexity and associated analysis requirements. This paper considers some of the application drivers for the evolution of data intensive computing from storage centric to analysis centric requirements.
Conference Paper
Cloud computing emerges as a new computing paradigm which aims to provide reliable, customized and QoS guaranteed computing dynamic environments for end-users. This paper reviews recent advances of Cloud computing, identifies the concepts and characters of scientific Clouds, and finally presents an example of scientific Cloud for data centers
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Conference Paper
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the distributed co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.
Article
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
Article
This paper discusses the concept of Cloud Computing to achieve a complete definition of what a Cloud is, using the main characteristics typically associated with this paradigm in the literature. More than 20 definitions have been studied allowing for the extraction of a consensus definition as well as a minimum definition containing the essential characteristics. This paper pays much attention to the Grid paradigm, as it is often confused with Cloud technologies. We also describe the relationships and distinctions between the Grid and Cloud approaches.
Article
Networks of coupled dynamical systems have been used to model biological oscillators, Josephson junction arrays, excitable media, neural networks, spatial games, genetic control networks and many other self-organizing systems. Ordinarily, the connection topology is assumed to be either completely regular or completely random. But many biological, technological and social networks lie somewhere between these two extremes. Here we explore simple models of networks that can be tuned through this middle ground: regular networks 'rewired' to introduce increasing amounts of disorder. We find that these systems can be highly clustered, like regular lattices, yet have small characteristic path lengths, like random graphs. We call them 'small-world' networks, by analogy with the small-world phenomenon (popularly known as six degrees of separation. The neural network of the worm Caenorhabditis elegans, the power grid of the western United States, and the collaboration graph of film actors are shown to be small-world networks. Models of dynamical systems with small-world coupling display enhanced signal-propagation speed, computational power, and synchronizability. In particular, infectious diseases spread more easily in small-world networks than in regular lattices.
Conference Paper
Data intensive computing is concerned with creating scalable solutions for capturing, analyzing, managing and understanding multi-terabyte and petabyte data volumes. Such data volumes exist in a diverse range of application domains, including scientific research, bio-informatics, cyber security, social computing and commerce. Innovative hardware and software technologies to address these problems must scale to meet these ballooning data volumes and simultaneously reduce the time needed to provide effective data analysis. This paper describes some of the software architecture challenges that must be addressed when building data intensive applications and supporting infrastructures. These revolve around requirements for adaptive resource utilization and management, flexible integration, robustness and scalable data management.