The Analysis of Subcriber in Postpaid Mobile Communication Industry (Case Study: Indosat Matrix)

(1)

IN P

FA

POSTPAID

ACULTY O

BO

THE ANA

D MOBILE

( Case

NUR

DEPART

OF MATHE

OGOR AGR

ALYSIS OF

E TELECO

Study: Ind

R ANDI SE

TMENT OF

EMATICS

RICULTUR

2009

F SUBSCR

OMMUNIC

dosat Matri

ETIABUDI

F STATIST

AND NAT

RAL UNIV

9

RIBER

CATION IN

ix )

TICS

TURAL SC

VERSITY

NDUSTRY

(2)

(3)

IN P

FA

POSTPAID

ACULTY O

BO

THE ANA

D MOBILE

( Case S

NUR

DEPART

OF MATHE

GOR AGR

ALYSIS OF

E TELECO

Study: Indo

R ANDI SE

TMENT OF

EMATICS

RICULTUR

2009

F SUBSCRI

MMUNICA

osat Matrix

ETIABUDI

F STATIST

AND NAT

RAL UNIV

IBER

ATION IN

x )

TICS

TURAL SC

VERSITY

NDUSTRY

(4)

ABSTRACT

NUR ANDI SETIABUDI. The Analysis of Subscriber in Postpaid Mobile Telecommunication Industry (Case Study: Indosat Matrix ). Advised by ASEP SAEFUDDIN and WISHNU SUBEKTI.

Accurate segmentation, profiling and churn analysis are appropriate way to encounter customer issue in order to face business competition in mobile telecommunication industry. This research was established a comprehensive study in segmentation, profiling and churn analysis for Indosat Matrix’s subscribers using statistics and data mining tools with considering the abilities in business.

Segmentation was built by K-means clustering algorithm based on subscribers’ values. The five

segments were decided in business expert. The result pointed out that K-Means enable to define segments. Then profiles of each segment according were visualized on two dimension plot called biplot. Each segment had different characteristic in usage to each other. Analysis of churn was performed by binary logistic regression. The analysis was performed twice. The first was to estimate model of churn based on invoice and tenure. Several statistical tests suggested that this model was considered enable to predict churn event. By fitting current invoice and tenure to estimated model of churn, subscribers could be classified into ‘at risk’ or ‘at safe’ group according to estimated probability of churn. Since classification was obtained, the second model was subscribers ‘at risk’ versus derived variables of invoice, called features usage, as explanatory variables. This model was also considered enable to discriminate subscribers ‘at risk’ and ‘at safe’ excellently.

(5)

THE ANALYSIS OF SUBSCRIBER

IN POSTPAID MOBILE TELECOMMUNICATION INDUSTRY

( Case Study: Indosat Matrix )

NUR ANDI SETIABUDI

G14052422

Research Report

to complete the requirement for graduation of Bachelor Degree in Statistics at Department of Statistics

Faculty of Mathematics and Natural Sciences Bogor Agricultural University

DEPARTMENT OF STATISTICS

FACULTY OF MATHEMATICS AND NATURAL SCIENCES

BOGOR AGRICULTURAL UNIVERSITY

(6)

Title : The Analysis of Subscriber in Postpaid Mobile Telecommunication Industry (Case Study: Indosat Matrix )

Author : Nur Andi Setiabudi

NIM : G14052422

Approved by :

Advisor I

Dr. Asep Saefuddin, M.Sc. NIP. 195703161981031004

Advisor II

Wishnu Subekti, ST, MM. NIK. 75013679

Acknowledged by :

Dean of Faculty of Mathematics and Natural Sciences Bogor Agricultural University

Dr. drh. Hasim, DEA NIP. 196103281986011002

(7)

BIOGRAPHY

Nur Andi Setiabudi was born in Cilacap on first of September, 1987 as the son of Duryat and Irun. He has a brother and two sisters.

He finished his education form SD Negeri Dayeuhluhur 03 at 1999 and graduated from SLTP Negeri 1 Dayeuhluhur at 2002. After graduated from SMA Negeri 1 Dayeuhluhur in 2005, he continued his study in Bogor Agricultural University through USMI. A year later, he took Statistics as his major in Department of Statistics, and also chose Consumer Sciences in Department of Family and Consumer Sciences as the supporting courses.

(8)

ACKNOWLEDGEMENTS

Alhamdulillah, many grateful to Allah SWT as The Most Merciful, Who gives me chance, spirit, healthy, and capability especially in finishing my research.

This paper is the representation of my research in customer relationship management. It was performed to complete a requirement for graduation of Bachelor Degree in Statistics, at Department of Statistics, Faculty of Mathematics and Natural Sciences, Bogor Agricultural University.

I have been admitted that the completion of my research would not be possible without help from many people, since the research has just planned until finished. Thousand appreciations are presented for their ideas, critics, and improvement during the process. I would like to express my sincere gratitude to my advisors, Mr. Asep Saefuddin for his expert guidance and suggestion for this research, and Mr. Wishnu Subekti for enlightening discussion. Thanks are shown to Mr. Fahar Yuhandi for his valuable help in providing data. Anyway, I also wish to thank all my friends in ‘Statistika 42’ and ‘Pondok Assalam’ for togetherness in finding knowledge and truly friendships. I give my special thanks to ‘my special’, Widya Ningsih, for sharing of the nice days. I am especially grateful to my beloved family, Imih, Bapa, Ibu, Riska, Kang Ofik, Teh Enci, Pa Ridwan, Tegar and Kia for their never ending love and support.

Finally, I wish my little work would be useful for all.

Bogor, September 2009

(9)

CONTENT

Page

LIST OF FIGURE ··· viii

LIST OF TABLE ··· viii

LIST OF APPENDIX ··· viii

INTRODUCTION ··· 1

Background ··· 1

Objective ··· 1

LITERATURE REVIEW ··· 2

Cluster Analysis ··· 2

Biplot Analysis ··· 2

Binary Logistic Regression ··· 3

METHODOLOGY ··· 4

Source of Data ··· 4

Method ··· 4

RESULT AND DISCUSSION ··· 5

Segmentation and Profiling ··· 5

Profile of Sample ··· 5

Segments and Profiles ··· 5

Churn Analysis ··· 7

Model of Churn ··· 7

Model of ‘At Risk’ ··· 8

CONCLUSION ··· 10

RECOMMENDATION ··· 10

(10)

LIST OF FIGURE

Page

Figure 1 Plot of tenure vs. invoice by segment ··· 5

Figure 2 Biplot for segment and feature usage ··· 6

LIST OF TABLE

Page Table 1 Descriptive statistic of sample ··· 5

Table 2 Invoice usage of sample by feature ··· 5

Table 3 Evaluation of segmentation result ··· 5

Table 4 Segment summary information ··· 6

Table 5 Description of sample for churn analysis ··· 7

Table 6 Summary of logistic regression analysis of churn ··· 8

Table 7 Description of sample for ‘at risk’ analysis ··· 9

Table 8 Summary of logistic regression analysis of ‘at risk’ ··· 9

LIST OF APPENDIX

Page Appendix 1 Description of variables for analysis ··· 13

Appendix 2.A Histogram of tenure on standardized data ··· 14

Appendix 2.B Histogram of invoice on standardized data ··· 14

Appendix 3.A Segment summary ··· 15

Appendix 3.B Mean and standard deviation of invoice and tenure by segment on standardized data ··· 15

Appendix 4 Percentage of invoice by feature and segment ··· 16

Appendix 5 Definition of categorical and dummy variables for churn and ‘at risk’ analysis ··· 17

Appendix 6.A Bar chart of subscriber churn and stay by tenure ··· 18

Appendix 6.B Bar chart of subscriber churn and stay by invoice ··· 18

Appendix 7.A ROC curve for model of churn ··· 19

Appendix 7.B Plot of sensitivity and specificity versus all possible cut off points in the model of churn ··· 19

Appendix 8.A Probability of churn by all possible categories of tenure ··· 20

Appendix 8.B Probability of churn by all possible categories of invoice ··· 20

Appendix 9.A Bar chart of subscriber ‘at risk’ and ‘at safe’ by tenure and invoice ··· 21

Appendix 9.B Bar chart of subscriber ‘at risk’ and ‘at safe’ by features usage ··· 21

Appendix 10.A ROC curve for model of ‘at risk’ ··· 23

Appendix 10.B Plot of sensitivity and specificity versus all possible cut off points in the model of ‘at risk’ ··· 23

Appendix 11.A Probability to categorized ‘at risk’ by all possible categories of voice domestic usage ··· 24

(11)

1

INTRODUCTION

Background

The mobile telecommunication industry has been dynamically developing over the years. It is going to create new opportunities and be more profitable for business. Therefore, many companies are interested in joining this sector which then yielded tight competition. Providers compete fiercely to each other in acquiring new subscribers and retaining the existing ones to raise profitability.

The provider has to offer the best services. But, the biggest challenge comes from customer issues. Understanding the subscribers is getting important to face the business environment. Provider must be able to know well about their subscribers. To make it easy, it is necessary to classify subscribers in several segments and profile them according to desirable criteria. Moreover, provider also has to recognize the high risk subscribers to avoid them from churning.

Segmentation is a term to describe the process of dividing subscribers into homogeneous groups or classes called segments based on similar characteristics, such as value and usage behavior. Using segmentation, provider is more effective in channeling resources and discovering opportunities (Jansen 2007). Accurate verifiable segmentation gives information to decision makers to evaluate and execute strategies for improving subscribers’ profitability and campaigns efficiency.

Profiling is describing subscribers and subscribers within associated segment by their attributes. Knowing the profile of each customer, provider can treat the customer according to what they needed in order to increase the lifetime value (Boundsaythip & Runsala 2001).

Term of churn refers to attrition or degradation of the number of subscribers. There are three kinds of churn in literatures; those are involuntary churn or forced attrition, voluntary churn, and unavoidable or expected churn (Berry & Linoff 2004; Yang & Chiu 2006).

Predicting customer churn is very critical. It is useful for provider to identify signals of churn. Likelihood or probability of churn can be analyzed by using their call record generated and stored on the data warehouse system. Once churn indications were detected, provider can determine what incentives that

should be offered to their subscribers from the risk group in order to improve retention and extend loyalty.

As one of mobile telecommunication services provider in Indonesia, Indosat also interested in segmentation, profiling and churn analysis for their subscribers. So far, for their postpaid service, Matrix, segmentation and profiling have been performed subjectively based on invoice which is divided into two segments: regular and VIP. Actually modeling for churn has not been analyzed (Subekti 2009; personal communication).

This research was established a comprehensive studies in segmentation, profiling and churn analysis for a million Matrix subscribers using statistics and data mining tools with considering the abilities in business. Segments were defined by K-means clustering algorithm based on subscribers’ values, those were invoice and tenure. Biplot was very useful to visualize profile of each segment according to features usage. Analysis of churn was also performed by binary logistic regression using invoice for features usage as the explanatory variables.

There were related works have been already performed. Jansen (2007) defined segments for Vodafone’s subscribers based on usage behavior. Several clustering techniques were adopted, and then the results of each technique were evaluated and compared to each other. Jansen then profiled each segment according to demographic data. The relation between segments and profiles was also analyzed. Lin (2007) performed segmentation based on call detail record for a mobile

operator’s subscribers. Lin utilized K-means

for executing his research. Mozer et al (2000)

predicted churn by using data from call detail record. Models were constructed by logistic regression, decision tree and neural network. In addition, a case study for churn analysis applying logistic regression was established by Mutanen (2006).

Objective

The main objectives of this research were:

1. To define segment of subscribers based on

their values represented by tenure and invoice.

2. To describe profile of each segment

according to feature usage.

3. To obtain factors affecting the churn based

(12)

2

BRIEF THEORITICAL REVIEW

Cluster Analysis

The objective of cluster analysis is the

organization of n objects into K clusters (K <

n) according to similarities among them, such

that objects within a cluster are more similar to each other than to objects in other clusters. In business perspective, clustering process have similar mean with segmentation.

The principle of cluster analysis is the similarity measure object based on variables. This similarity may become the distance measurement. There are many ways to measure similarity. Euclidean distance is recommended for a measure of similarity on clustering. Euclidean distance appropriates for uncorrelated variables. If correlation occurs, the data should be transformed using principal component analysis.

The Euclidean distance between two

p-dimensional objects, X = [x1, x2, …, xp]', and

Y= [y1, y2, …, yp]', is

) ( )' ( ) ,

(x y x y x y

d_E = − − (1)

If the condition to use Euclidean distance could not be satisfied, then we have to use Mahalonobis distance. Actually, Mahalonobis distance is Euclidean distance but weighted with covariance matrix. Mahalonobis distance

of two objects, Xand Y, is defined as :

) ( )' ( ) ,

(_x _y _x _y _S 1 _x _y

d_M = − − − (2)

where matrix S contains the sample variances

and covariances. However, without prior

knowledge of the distinct group, the S cannot

be computed. For this reason, Euclidean distance is often preferred for clustering (Johnson & Wichern 1998).

Basically, there are two method of cluster analysis; hierarchical method and non-hierarchical or partitional method. Hierarchical method permit a cluster to has sub-clusters, and it is often organized in dendogram. Hierarchical method is appropriate if the size of data set was not so large, and the number of clusters has not been known. A partitional method is simply a division of data set objects into non-overlapping cluster such that each data object is in exactly one cluster. It is possible to run clustering from huge number of data set.

To determine the number of clusters should be formed, but so far no generally accepted procedure. The usual ways are by plotting the scores of first two principal components. This decision should be guided by theory and practicality of the result.

The K-means is the famous algorithm for

partitional clustering. This algorithm was first

published by McQueen (1967). His K-means

concept represents a generalization of ordinary sample mean. The process appears to give partitions which are reasonably efficient

in the sense of within-cluster variance.

K-means allocates each object to one of the K

clusters to minimize the within of sum of square :

∑ ∑

= ∈ = K

i x C

i i i x c d SSE 1 2 ) , ( (3)

and

∑

∈

=

i C x i i

x

m

c

1

₍₄₎

where xi= the-ithobject, Ci = the-jthcluster, ci

= cluster center (centroid) of Ci, mi = the

number of objects in the-ith cluster, K = the

number of clusters and d = Euclidean

distance.

Basic algorithm of K-means is described

as follows:

1. Select K data points to be initial centroid.

2. Assigns each object to the nearest

centroid.

3. Recompute the centroid of each cluster.

4. Repeat step 2 and 3 until centroids do not

change.

One approach to evaluate clustering is by

root mean square standard deviation (RMS). It

provides a measure of average distance

between each object within cluster. The RMS

of a cluster Ci is

) 1 ( ) , ( 2 − =

∑

∈ i C x i i

m v

x c d

RMS i ₍₅₎

where v = number of variables.

Well-separated clusters comprised of homogenous

objects will have a small of RMS.

Biplot Analysis

A biplot is graphical representation of the

information in n x p data matrix. The bi-

refers to the kind of information contained in a data matrix. The information in the rows pertains to object and that in the columns pertains to variables (Johnson & Wichern 1998).

(13)

3

The biplot analysis is based on the singular

value decomposition, SVD. Consider an n x p

matrix of rank r, where r≤p≤ n, the matrix

then may be decomposed as :

X = U L A' (6)

where Unxp and Apxr are matrices of singular

vectors and Lrxr is a diagonal matrix of

singular values of matrix X. U is the matrix

with column corresponding to the p

orthogonal eigenvectors of X'X and A is the

orthogonal matrix corresponding to the

eigenvector of X'X. The singular values are

the positive square roots of the eigenvalues of

X'X.

Matrix equation in (6) can be written as :

X = U Lα L1-αA' (7)

where 1 ≤α≤ 1. If G = U Lαand H = L1-αA',

then the (i,j)th element of matrix X can be

expressed by :

Xij = gi' hj (8)

where i = 1, 2, ... , n and j = 1, 2, ..., p, and the

gi' are rows of G, and the hi' are rows of H

(Sartono et al 2003).

Although many values of α are possible,

three are commonly used, 1, ½, and 0. When the value 1 is selected, the result is called a row metric preserving biplot. In this display the distances between pairs of rows is preserved and is useful for studying objects. When the value 0 is selected, the result is a column metric preserving biplot. This display preserves distances between the columns and is useful for interpreting variance and relationships between variables. The other

value of α, ½, gives equal scaling or weight to

the rows and columns. It is useful for interpreting interaction in two factors (Lipkovich & Smith 2002).

The ability of biplot in representing the variety from the original data can be computed as follows:

∑

= + = _r k k 1 2 1 2 λ λ λ ρ (9) where :

λ1 = the first biggest eigenvalue

λ2 = the second biggest eigenvalue

λk = the kth eigenvalue

Binary Logistic Regression

The binary logistic regression is a form of regression which is used for binary response

variable; such as ‘event’ (y=1) and ‘nonevent’

(y=0), ‘churn’ and ‘stay’, etc. Suppose, there

is a single explanatory variable x. The logistic

regression model has linear form for the logit

of probability of event at value x, π(x), as

follows:

x x

x

π(x) α β

π π ₌ ₊ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − = ) ( 1 ) ( ln ] logit[ (10)

Then, the odds of a event are :

x e e x x x ) ( ] exp[ ) ( 1 ) ( _α _β α β π π ₌ ₊ ₌ − (11)

This exponential relationship provides an

interpretation of β : the odds multiply by eβ

for every 1-unit increase in x (Agresti 2007).

Suppose, x1 and x2 are values in x. The

odds-ratio of x1 to x2, θ, are :

) ( 1 ) ( ) ( 1 ) ( 2 2 1 1 x x x x

π

θ

− − = ) exp( ) exp( 2 1 x x β α β α + + = )] (

exp[ x1−x2

= β (12)

or

ln(θ) = β(x1 – x2) (13)

Refer to equation (12), odds-ratio θ = exp(β)

when x1=1 and x2=0.

The odds ratio is a measure of effect size, describing the strength of association or non-independence between two binary data values. It plays an important role in logistic regression that is often used for drawing conclusion of model.

The significance of the explanatory variables in the logistic model related to

response variable could be assessed by G test

statistic and Wald test. G test statistic is

likelihood ratio test which is used to measure the significance of the parameters in the

model overall. Denote there are p explanatory

variables for logistic regression model, then G

test statistic could be expressed as :

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ − = p L L

G ₂_ln 0 ₍₁₄₎

where L0 = likelihood without explanatory

variables, and Lp = likelihood with p

explanatory variables, with hypothesis of test :

H0 : β0 = β1 = … = βp = 0, versus

H1 : at least one βi≠ 0; where i = 1, 2, …, p

G test statistic follows a chi-square

distribution with p degrees of freedom.

If the null hypothesis in the G test was

rejected, the Wald test could be used to assess

the significance of each βi partially. The

formula of Wald test statistic is :

(14)

4

with hypothesis of test :

H0 : βi = 0, versus

H1 : βi≠ 0; where i = 1, 2, …, p

Under the null hypothesis, the W statistic

follows a normal distribution. (Hosmer & Lemeshow 2000).

Accuracy of logistic regression model is evaluated by classification table which is rely on a single cut off point to classify test result. It gives information about correct classification rate (CCR), sensitivity and specificity. Sensitivity measures the proportion of correctly classified events, whereas specificity measures the proportion of correctly classified nonevent. (Peng, Lee & Ingersoll 2002). Hosmer & Lemeshow (2000) underlined that classification table is most appropriate when classification is a started goal of the analysis.

A more complete description of classification accuracy is given by the area under the ROC (receiver operating

characteristics) curve or commonly called

C-statistic. Suppose there is a total of t pairs

with different responses, nc of them are

concordant, nd of them are discordant, and t–

nc–ndof them are tied, C-statistic is expressed

by (SAS Institute 2003) :

C = [nc + 0.5(t–nc–nd)] /t (16)

It plots the sensitivity and 1– specificity for entire range of possible cut off point. The area under ROC curve, which is range from zero to one, provides a measure of the model ability to discriminate between those subject who experience the event of interest versus

those who not. As general rule, C = 0.5

suggests no discrimination; 0.7 ≤ C < 0.8 is

considered acceptable discrimination; 0.8 ≤ C

< 0.9 is considered excellent discrimination;

and C ≥ 0.9 is considered outstanding

discrimination (Hosmer & Lemeshow 2000).

METHODOLOGY

Source of Data

In the term of segmentation and profiling, this research involved samples approximately 3% of total active MSISDNs in three months serially generated randomly from data warehouse. Segmentation has been performed based on tenure and invoice. Feature usage was adopted to profile subscribers and their associated segment. Appendix 1 provides list of those variables.

This research used a little more than data mentioned above for analyzing subscribers churn. Churn analysis involved sample about

6% of active MSISDNs in three months ago. Invoice and tenure was used as explanatory variables of churn. To estimate model of ‘at risk’, about 2.8% of current active MSISDNs were used. In this analysis, explanatory variables were features usage. All variables are described on Appendix 1. Only MSISDNs with more than zero invoices were used.

Method

This research was divided into two sections. The first was aimed to perform segmentation and profiling, and the second one was to analyze subscriber churn and subscriber ‘at risk’. Methodologies of this research are summarized as follows:

Section I : Segmentation and profiling

1. Preparing the data; included exploring

characteristics of data, transforming the

data into z values to adjust scale, and

stripping out the outliers.

2. Calculating the Pearson correlation

coefficient between variables.

3. Running the K-means clustering algorithm

to make segments, based on tenure and invoice, and reviewing the result.

4. Constructing the biplot.

5. Finding profiles of each segment.

6. Interpretation.

Section II : Churn Analysis

1. Preparing the data; included classifying

explanatory variables’ values into categories.

2. Representing categories of each

explanatory variable into dummy variables. Appendix 5 displays information about dummy variables.

3. Finding the main characteristics of the

data.

4. Running the binary logistic regression for

making model of churn.

5. Classifying subscribers into ‘at risk’ or ‘at

safe’ group by fitting the invoice and tenure in current month to model of churn from step 3 refer to optimum cut off point,

which is ‘at risk’ (risk=1) if probability to

churn exceed cut off point, otherwise ‘at

safe’ (risk=0).

6. Making logistic model of subscriber at

risk.

7. Checking and evaluating each logistic

regression model.

8. Interpreting result consider to business

(15)

5

RESULT AND DISCUSSION

Segmentation and Profiling

Profile of Sample

Among the satisfied samples, about 32 MSISDNs which exceed 99.9% percentile of invoice have been stripped out, because those were indicated as outliers.

Segmentation has been performed based on customer’s values by utilizing two variables; those were tenure or duration of subscription and invoice (invoice average in three months). And then, derived variables of invoice called features usage were used to profile segments.

Table 1 Descriptive statistic of sample

Statistic

Tenure (Month)

Invoice (Rp.)

Minimum 1.03 0

Maximum 177.00 3,998,803

Mean 52.56 177,341

Median 44.20 91,920

Std. deviation 41.57 279,377

Descriptive statistic of tenure and invoice of samples are summarized on Table 1. Whereas, distribution of samples could be seen on Appendix 2.A and 2.B. By Table 1, samples had very high variance, especially invoice. Minimum value of tenure was 1 months, the maximum one was 177 months, and it had mean 53 days. Mean of invoice was Rp 177,341, but about 50% of subscribers only had invoice less than the amount of Rp 92,000.

Table 2 Invoice usage of sample by feature

Feature Percentage of invoice Voice Domestic 59.88 International 4.06 VoIP 2.05 3G 0.09 Value Added SMS 16.97 MMS 0.10 GPRS 1.44

3G Data 0.55

International Roaming

Voice 8.98 SMS 3.69 GPRS 2.19

Invoice 100.00

The largest part of invoice were spent for using voice domestic, SMS and voice

international roaming. Subscribers spent over 85% of invoice for using those three features. On the contrary, subscribers spent less than 1% of invoice for enjoying features with 3G. Table 2 displays invoice usage of sample by features.

Segments and Profiles

Because variable tenure and invoice were on different scale, at the first, the data have

been standardized into z values which had

zero mean and standard deviation one, follow

formula z=(x−x)/s where x and s were

mean and standard deviation of samples. Since the Pearson correlation between tenure and invoice was about 0.27, Euclidean

distance was used when executing K-means

clustering algorithms because there is no strong correlation between variables. The number of segments to be indentified was determined in business expert. In this case, Indosat decided to have five segments (Subekti 2009, personal communication).

Figure 1 Plot of tenure vs. invoice by segment

Table 3 Evaluation of segmentation result

Segment RMS

Dist. to Nearest Segment

Dist. Ratio

A 0.30 1.10 3.65

B 0.39 1.10 2.80

C 0.45 1.76 3.89

D 0.91 2.27 2.48

E 1.72 4.88 2.83

Figure 1 shows that K-means allocated

each subscriber into non-overlapping segment, so that one subscriber was exactly in one segment. The root mean square standard deviation for each segment is shown in second column of Table 3. The distance to nearest segment provides a measure of the separation between centroids. Distance ratio was calculated by dividing distance to the nearest segment with the root mean square standard

deviation, RMS. Following the Table 3,

Tenure (month)

(16)

6

segment D and E have very large variation within segment, and the segment E was the largest variation. This situation also was shown in Figure 1 obviously. However, the distance ratio were large enough, hence this situation provided the satisfactory result. For further consideration, Appendix 3.A and 3.B

provides more information about K-means

clustering result.

Recall Figure 1, all five segments formed homogenous subscribers within segment, and each segment was also most likely different to each others. The first three segments (A, B, and C) had the lowest invoice, but they were separated by tenure. The last two segments (D and E) had higher invoices compared to A, B, and C. The tenure of segment D and E scatterly distributed or they have very large range.

Table 4 Segment summary information

Segment Subsc. Mean Tenure (Month) Invoice (Rupiah)

A 12,418 15.33 73,865

B 12,562 59.97 137,674

C 4,064 132.63 200,978

D 1,965 69.43 767,615

E 271 92.00 2,123,224

ALL 31,280 52.56 177,341

Profiling was created by exploring size and mean of tenure and invoice of each segment. Those profiles were displayed in

Table 4. In additional, profiling was also created based on percentage of invoice for using features. Information about feature usage were summarized in Appendix 4 and visualized by biplot in Figure 2. Using SAS macro which has been written by Friendly

(1998), the biplot selected α=½ and was able

to cover the information about 99.6%.

Finally, according to Table 4 and Figure 2, the characteristics of segment are described as the following:

1. Segment A :

Segment A was occupied by 39.7% of subscribers. In this case, the new subscribers with the lowest invoice were considered belong to this segment. The largest parts of invoice were spent for utilizing domestic voice and SMS features. Interestingly, the segment A used secondary features, such as voice 3G, GPRS, MMS and data 3G, more than the other segments did.

2. Segment B :

Segment B was occupied by 40.2% of subscribers. It was likely the largest segment. In the average, subscribers of the segment B have registered their subscription since five years. However, they spent the small amount of invoice. The largest parts of invoice of this segment were for voice domestic dialing and SMSs.

Figure 2 Biplot for segment and feature usage

Dimension 1 (98.1%)

Di men sio n 2 (1. 5 % )

(17)

7

3. Segment C :

Segment C was only occupied by 13.0% of subscribers. Although the subscribers of this segment contributed low invoice, they were considered had high level of loyalty which indicated by tenures were over 10 years. Subscribers within segment C associated with voice domestic feature.

4. Segment D :

Segment D was only occupied by 6.3% of subscribers. The invoice mean of subscribers within segment D was greater than over-all subscribers did. Additionally, the variance of tenure was very large. It is indicated that this segment include very low tenure and very high tenure. Compared to other segments, subscribers of segment D were VoIP’s users. They also spent invoice more than that of the first three segments in using feature for international connection purposes, such as international voice and roaming.

5. Segment E :

Segment E was only occupied by 0.9% of subscribers. It was the smallest segment formed. The subscribers of segment E were considered as the most profitable which indicated by very high of invoice. Like segment D, the variance of tenure was very large. The most of invoices are allocated for using international features, e.g. international roaming and voice dialing

According to the biplot, segment A had similar behavior with segment B in spending invoice. They were dominated by basic feature users. They also spent their invoice for using several secondary features although in small amount. Hence, segment C had similar behavior with segment D in spending invoice. Segment E was very different to other segments in spending invoice. They were subscribers who need international connectivity services.

The biplot in Figure 2 also provides important information about features usage. According to the biplot, SMS usage and voice domestic were uncorrelated. There were positive correlation between SMS, GPRS, data 3G, MMS and voice 3G usage, but those had negative correlation to SMS international roaming and VoIP usage. Voice international roaming also had positive correlation to voice international and GPRS international roaming usage, but those had negative correlation to voice domestic usage.

Churn Analysis

Model of Churn

Logistic regression model was used to analyze the churn of subscribers. The explanatory variables were categories of tenure and invoice in three months ago. Those two-explanatory variables were fitted to the active status in current month; ‘churn’ (churn=1) or ‘stay’ (churn=0). It was involved MSISDNs of 60,000 selected randomly. Description of samples provided in Table 5, and for additional information, Appendix 6.A and 6.B provides distribution of sample by categories of tenure and invoice.

Table 5 Description of sample for churn analysis

Churn Frequency

Total Percentage

0 57653 96.1

1 2347 3.9

Summary of logistic regression analysis of churn are presented in the Table 6.

Overall model evaluation is examined by

using likelihood ratio or G test. This test

yielded that the logistic model of churn was significant, hence it was effective to estimate churn event based on tenure and invoice in three months ago.

Model performance was measured by area

under ROC curve or C-statistic. In this model,

C-statistic exceeded 0.841. This means that for 84% of all possible pairs of subscribers– one was churn and the other stay–the model correctly assigned a higher probability to those who were churn. It was considered excellent discrimination of churn and stay. For further information, Appendix 7.A provides figure of ROC curve for model of churn.

To assess accuracy of model, an optional cut off point should be decided as that maximizes sensitivity and specificity. In this case, optimal cut off point was at 0.05 (Appendix 7.B) which yielded correct classification rate 78%, sensitivity 76%, and specificity 78%. This result also suggested that model gave satisfactory result in predicting the churn events.

Probability of churn for any given tenure and invoice could be simply illustrated as following. A subscriber with tenure=1 (less than 3 months) and invoice=1 (less than Rp 25,000) has estimated probability of churn:

(18)

8

Table 6 Summary of logistic regression analysis of churn

Explanatory variables β) SE(β)) df Wald’sχ2 θ)

Intercept –5.708 0.107 1 2848.500

Tenure 5 2344.989

Tenure1 3.710 0.109 1 1151.502 40.842

Tenure2 3.038 0.113 1 726.426 20.855

Tenure3 2.147 0.121 1 313.251 8.560

Tenure4 2.481 0.113 1 481.646 11.957

Tenure5 0.724 0.119 1 37.098 2.063

Invoice 3 511.372

Invoice1 0.816 0.097 1 70.343 2.262

Invoice2 1.299 0.062 1 437.354 3.666

Invoice3 0.399 0.063 1 40.470 1.491

Test df χ2 _p-value

Model evaluation

Likelihood ratio (G) test 8 3873.261 <.0001

Wald test 8 2807.530 <.0001

Model power and classification accuracy

Area under ROC curve Cut off Correct Rate Sensitivity Specificity

0.841 0.05 78.1 75.5 78.3

Subscripts denote categories of explanatory variables.

Using similar calculation, a subscriber with tenure=2 and invoice=1 has estimated probability 0.135. Estimated probabilities of churn for all possible tenure categories and given invoice categories were plotted in the figure of Appendix 8.A. Estimated probabilities of churn for all possible invoice categories and given tenure categories were plotted in the figure of Appendix 8.B.

According to Appendix 8.A, for constant invoice (category 1), estimated probability of churn was decreasing while tenure category was increasing, but estimated probability of churn when tenure=4 was a little more than that of tenure=3. Obviously, tenure=1 had the highest estimated probability of churn, and tenure=6 had the lowest. According to Appendix 8.B, for any given tenure (categories=1), estimated probability of churn was the highest when invoice=2, and reached the lowest when invoice=4.

The last column of Table 6 provided estimated odds ratio which could be interpreted as ratio of odds to experience churn between any given category of explanatory variables and its reference factor. In this research, reference factors were the highest category of each explanatory variable (dummy variables matrix was provided in Appendix 5). To illustrate, odd ratio of

Tenure1=40.842 means the odds of a

subscriber who had tenure at category 1 (less than 3 months) being churn were about

exp(3.710)≈40.842 times greater than the odds

for a subscriber who had tenure at category 6 (more than 60 months). Using similar way, the

odd ratio of Invoice1=2.262 could be

interpreted as the odds of a subscriber who had invoice at category 1 (less than Rp 25,000)

being churn were about exp(0.816)≈2.262

times greater than the odds for a subscriber who had invoice at category 4 (more than Rp 150,000); etc.

Thereby, according to estimated probability and odd ratio, for a given invoice category, provider have to focus more on subscribers who have tenure at category 1, 2 and 4 to avoid them from churning. Furthermore for a given tenure category, provider have to focus more on subscribers who had invoice at category 1 and 2 to avoid them from churning.

Model of ‘At Risk’

Since logistic regression model of churn yielded satisfactory result, subscriber could be

classified into ‘at risk’ or ‘at safe’ group. It

(19)

9

model of churn as determined before. If estimated probability of churn was greater than or equal cut off point (0.05) then subscribers were classified as subscriber ‘at

risk’ (risk=1), otherwise as subscriber ‘at safe’

(risk=0). By this approach, about 2.8% of active MSISDNs were categorized as shown in Table 7, and distributions of samples were provided in Appendix 9.A and 9.B.

Table 7 Description of sample for ‘at risk’ analysis

Risk

Frequency

Total Percentage

0 22691 81

1 5387 19

Model of ‘at risk’ was also predicted by logistic regression. In this case, explanatory variables were features usage. Base on stepwise variables selection procedure, there

were nine explanatory variables remain. Those-nine explanatory variables were voice domestic, voice international, voice VoIP, voice 3G, voice international roaming, SMS, GPRS, SMS international roaming, and data 3G usage. MMS and GPRS international roaming usage were removed (Table 8).

The G test yielded that the logistic model

of at risk was significant; hence it was effective to recognize subscribers ‘at risk’ or at safe group based on features usage.

The ROC curve (Appendix 10.A) which

yielded C-statistic of 0.809 suggested that

model is satisfied. This means that for more than 80% of all possible pairs of subscribers – one was categorized as ‘at risk’ and the other categorized as ‘at safe’– the model correctly assigned a higher probability to those who were categorized as ‘at risk’. It was considered excellent classification.

Table 8 Summary of logistic regression analysis of ‘at risk’

Explanatory variables β) SE(β)) df Wald’sχ2 θ)

Intercept –10.126 0.473 1 457.632

Voice Domestic 5 2022.337

Voice Domestic1 3.110 0.083 1 1402.526 22.418

Voice Domestic2 2.265 0.095 1 568.013 9.634

Voice Domestic3 2.047 0.083 1 607.101 7.744

Voice Domestic4 1.535 0.077 1 398.055 4.640

Voice Domestic5 0.785 0.099 1 63.515 2.192

Voice International1 0.798 0.129 1 38.323 2.221

Voice VoIP1 0.903 0.128 1 50.014 2.467

Voice 3G1 0.379 0.280 1 1.832 1.460

Voice Intl. Roaming1 1.525 0.241 1 40.105 4.597

SMS 3 606.858

SMS1 1.784 0.118 1 226.900 5.954

SMS2 1.180 0.123 1 92.856 3.255

SMS3 0.887 0.117 1 57.123 2.428

GPRS1 –0.135 0.060 1 5.071 0.873

SMS Intl. Roaming1 1.640 0.220 1 55.711 5.159

Data 3G1 0.809 0.185 1 19.065 2.245

Test df χ2 _p-value

Model evaluation

Likelihood ratio (G) test 15 5833.413 <.0001

Wald test 15 3968.465 <.0001

Model power and classification accuracy

Area under ROC curve Cut off Correct Rate Sensitivity Specificity

0.80924 0.180 71.7 75.8 70.7

(20)

10

Accuracy of model was also measured by classification table. According to plot of sensitivity and specificity versus all possible cut off point (Appendix 10.B), an optimal cut off point was preferred to set at 0.180 which led model to had 72% of correct classification rate, 76% off sensitivity, and 71% of specificity. This result also suggested that model was satisfactory to classify subscribers into ‘at risk’ or ‘at safe’ group.

Probability to classified ‘at risk’ for a given category of explanatory variables could be illustrated as following. A subscriber with voice domestic=1 (less than Rp 5,000) and other explanatory variables’ category=1, has estimated probability to classified at risk :

0.665 ) 809 . 0 ... 110 . 3 126 . 10 exp( 1 ) 809 . 0 ... 110 . 3 126 . 10 exp( _≈ + + + − + + + + −

Using similar calculation, a subscriber with voice domestic=2 and other explanatory variables’ category=1 has estimated probability to classified ‘at risk’ 0.461 and a subscriber with SMS=2 and other explanatory variables’ category=1 has estimated classified ‘at risk’ probability 0.521. Estimated probabilities to classified ‘at risk’ for all possible voice domestic categories and given other explanatory variables were plotted in the figure of Appendix 11.A. Estimated probabilities to classified ‘at risk’ for all possible SMS categories and given other explanatory variables were plotted in the figure of Appendix 11.B. Probability to classified ‘at risk’ was decreasing if either category of voice domestic or category of SMS was increasing. Using similar examination, several explanatory variables, except GPRS, had similar tendencies. If other explanatory variables’ category=1 and GPRS category=2 probability of a subscribers to classified at risk equal to 0.695. Its tendency was also reported by negative estimated parameter hence less than zero odds ratio.

Odd ratio of Voice Domestic1=22.418

means the odds of a subscriber who spent less than Rp 5,000 for using voice domestic feature would be classified ‘at risk’ were about 22.418 times greater than the odds for a subscriber who spent more than Rp 150,000

for using same feature. Odd ratio of SMS1

=5.954 means the odds of a subscriber who spent less than Rp 5,000 for using SMS feature would be classified ‘at risk’ were about 5.954 times greater than the odds for a subscriber who spent more than Rp 100,000

for using same feature. Odd ratio of SMS Intl.

Roaming1=5.159 mean the odds of a

subscriber who spent less than Rp 5,000 for

using SMS international roaming feature would be classified ‘at risk’ were about 5.159 times greater than the odds for a subscriber who spent more than Rp 5,000 for using same

feature. Then, odd ratio of GPRS1=0.873

mean the odds of a subscriber who spent less than Rp 5,000 for using GPRS feature would be classified ‘at risk’ were lower about 0.873 times than the odds for a subscriber who spent more than Rp 5,000 using same feature.

CONCLUSION

K–means clustering algorithm was enable

to classify subscribers based on values represented by tenure and invoice. All five segments (A–E) formed homogenous subscribers within segment, and segments were also most likely different to each others. Segment A, B and C had low invoice, but those were separated by tenure. Whereas, segment D and E had higher invoice compared to A, B and C. The tenure of segment D and E scatterly distributed or they have very large range.

Logistic regression model of churn base on invoice and tenure was effective to predict churn event. If invoice was constant, category –one of tenure had the highest estimated probability of churn. Then, if tenure was constant, category–two of invoice had the highest estimated probability of churn. Logistic regression model of ‘at risk’ base on features usage was enable to classify subscribers into ‘at risk’ group or ‘at safe’ group. Voice domestic, voice international, voice VoIP, voice 3G, voice international roaming, SMS, SMS international roaming, and data 3G usage had similar tendency of estimated probability to classified ‘at risk’. The estimated probabilities were decreasing as category of explanatory variables of interest was increasing while the others were constant. In contrast, estimated probability to classified ‘at risk’ was increasing as GPRS usage was increasing.

RECOMMENDATION

Performing segmentation with others

clustering algorithm, such as K-median, Fuzzy

C-Means, etc is highly recommended. If data

(21)

11

More detail data of each subscriber in billing, demographic and behavior will yield more accurate prediction of model of churn. Model also will be better if research accommodate trends of subscriber behavior over the time and weighted variables. Other statistical and data mining tools, such as discriminant analysis, survival analysis, decision tree, or a more complex algorithm named artificial neural network are several alternatives to build model of churn.

REFERENCE

Agresti, A. 2007. An Introduction to

Categorical Data Analysis. Second

edition. New Jersey : John Wiley & Sons.

Berry, M.J.A. & G.S. Linoff. 2004. Data

Mining Techniques for Marketing, Sales, and Customer Relationship Management. Second Edition. Indianapolis : John Wiley & Sons.

Bounsaythip, C. & E.R. Runsala. 2001. Overview of data mining for customer

behavior modeling. Research report

TTE1-2001-18. VTT Information

Technology.

Friendly, M. 1998. Construct a biplot of observations and variables uses IML. Version 1.6. http://www.geocities.com /bagusco4/mybook/9.html. [August 31, 2009]

Hosmer, D.W. & S. Lemeshow. 2000. Applied

Logistic Regression. Second edition. Canada : John Wiley & Sons.

Jansen, S.M.H. 2007. Customer segmentation

and customer proﬁling for a mobile

telecommunications company based on usage behavior : a Vodafone case study. [Master Thesis]. Department of Mathematics of Maastricht University. Limburg.

Johnson, R.A & D.W. Wichern. 1998. Applied

Multivariate Statistical Analysis. Fourth edition. London : Prentice - Hall International.

Lin, Q. 2007. Mobile clustering analysis based

on call detail records. Communications of

the IIMA Vol. 7 Issue 4.

Lipkovich, I & E.P. Smith. 2002. Biplot and singular value decomposition macros for

Vol. 7, Issue 5, Jun 2002. Available on http : //www.jstatsoft.org/v07/i05/paper. [August 31, 2009]

MacQueen, J.B. 1967. Some methods for classification and analysis of multivariate

observations. Proceedings of the Fifth

Berkeley Symposium on Mathematical Statistics and Probability. 1. Berkeley, CA : University of California. pp 281 - 297.

Mozer, M.C. et al. 2000. Predicting

subscribers dissatisfaction and improving retention in the wireless

telecommunication industry. IEEE

Transactions on Neural Network, Special Issue on Data Mining and Knowledge Representation.

Mutanen, T. 2006. Customer churn analysis :

a case study. Research report No

VVT-R-01184-06, March 15. VTT Information Technology.

Peng, C.Y.J, K.L. Lee & G.M. Ingersoll. 2002. An introduction to logistic

regression analysis and reporting. The

Journal of Educational Research Vol. 96(1).

Sartono, B et al. 2003. Modul Teori Analisis

Peubah Ganda. Eds. B. Susetyo et al. Bogor : Departemen Statististika IPB. SAS Institute Inc. 2003. SAS User Guide.

Cary : NC.

(22)

(23)

13

Appendix 1 Description of variables for analysis

Feature Description

MSISDN Mobile subscriber’s ISDN, or commonly called mobile telephone

number.

Tenure Duration of subscription, calculate in day unit.

Invoice Average of invoice in three months (segmentation) or invoice per

month (churn). Invoice is sum of following variable.

Voice

VoDO Invoice for domestic voice, included local and long distance dialing.

VoIN Invoice for international voice.

VoIP Invoice for Voice over Internet Protocol (VoIP).

Vo3G Invoice for voice 3G, i.e. 3G video call.

Value Added Service

SMS Invoice for sending short text message.

MMS Invoice for sending multimedia message.

GPRS Invoice for accessing or browsing internet via GPRS network.

Da3G Invoice for accessing data or browsing internet via 3G network.

International Roaming

RoVO Interconnection fee when dialing to or from foreign countries.

RoSMS Interconnection fee when sending SMS to or from foreign countries.

(24)

14

Appendix 2.A Histogram of tenure on standardized data

Appendix 2.B Histogram of invoice on standardized data

Z value of tenure

Z value of invoice

Percent

(25)

15

Appendix 3.A Segment summary

Segment

% of

Subscribers RMS

Maximum Distance from Seed to Observation

Nearest Segment

Distance Between Segment

A 39.70 0.3007 1.7411 B 1.0983

B _{40.16 0.3917} _{1.4402 A}_1.0983

C 12.99 0.453 1.7968 B 1.7629

D _{6.28 0.9116} _{2.8774 B}_2.2662

E 0.87 1.7231 6.6637 D 4.8826

Appendix 3.B Mean and standard deviation of invoice and tenure by segment on standardized data

Segment

Z-value of Invoice Z-value of Tenure

Mean Standard Deviation Mean Standard Deviation

A _{-0.370384 0.3306458 -0.8959 0.2673418}

B -0.141986 0.3939036 0.1784 0.3894284

C _{0.084605 0.5580102 1.9267 0.3146814}

D 2.112819 0.8808986 0.4057 0.9412675

(26)

16

Appendix 4 Percentage of invoice by feature and segment

Feature

Percentage of Invoice

A B C D E

VoDO Voice Domestic 66.26 66.69 66.67 56.49 28.48

VoIN Voice International 1.93 1.67 3.55 5.21 12.38

VoIP Voice VoIP 1.64 1.23 1.70 3.09 2.95

Vo3G Voice 3G 0.22 0.10 0.04 0.05 0.02

SMS SMS 21.60 22.36 17.55 11.99 5.58

MMS MMS 0.14 0.12 0.09 0.08 0.02

GPRS GPRS 2.88 2.14 0.78 0.62 0.12

Da3G Data 3G 2.35 0.00 0.00 0.60 0.00

RoVO Voice Intl. Roaming 1.22 2.75 5.65 13.06 34.15

RoSMS SMS Intl. Roaming 0.80 2.30 3.28 5.36 8.67

(27)

17

Appendix 5 Definition of categorical and dummy variables for churn and ‘at risk’ analysis

Predictor Category Value1) Dummy Variables

Tenure 1 0 - 3 1 0 0 0 0

2 _{3 - 6} ₀ ₁ ₀ ₀ ₀

3 _{6 - 12} ₀ ₀ ₁ ₀ ₀

4 _{12 - 24} ₀ ₀ ₀ ₁ ₀

5 _{24 - 60} ₀ ₀ ₀ ₀ ₁

6 > 60 0 0 0 0 0

Invoice 1 0 - 25 1 0 0

2 _{25 - 50} ₀ ₁ ₀

3 50 - 150 0 0 1

4 > 150 0 0 0

Voice Domestic 1 0 - 5 1 0 0 0 0

2 5 - 10 0 1 0 0 0

3 _{10 - 25} 0 0 1 0 0

4 25 - 100 0 0 0 1 0

5 _{100 - 150} 0 0 0 0 1

6 > 150 0 0 0 0 0

Voice International 1 0 - 5 1

2 > 5 0

Voice VoIP 1 0 - 5 1

2 > 5 0

Voice 3G 1 0 - 5 1

2 > 5 0

Voice International Roaming 1 0 - 5 1

2 > 5 0

SMS 1 0 - 5 1 0 0

2 5 - 10 0 1 0

3 _{10 - 100} 0 0 1

4 > 100 0 0 0

MMS 1 0 - 5 1

2 > 5 0

GPRS 1 0 - 5 1

2 > 5 0

SMS International Roaming 1 0 - 5 1

2 > 5 0

GPRS International Roaming 1 0 - 5 1

2 > 5 0

Data 3G 1 0 - 5 1

2 > 5 0

(28)

18

Appendix 6.A Bar chart of subscriber churn and stay by tenure

(29)

19

Appendix 7.A ROC curve for model of churn

Appendix 7.B Plot of sensitivity and specificity versus all possible cut off points in the model of churn

Cut off point

Sens

it

ivi

ty

an

d

s

p

ecifi

(30)

20

Appendix 8.A Probability of churn by all possible categories of tenure

* Invoice was constant at category 1

Appendix 8.B Probability of churn by all possible categories of invoice

* Tenure was constant at category 1 Category of invoice

Prob

abil

ity

o

f chu

rn

Category of tenure

Prob

abil

ity

o

f chu

(31)

21

Appendix 9.A Bar chart of subscriber ‘at risk’ and ‘at safe’ by tenure and invoice

Tenure Invoice

Appendix 9.B Bar chart of subscriber ‘at risk’ and ‘at safe’ by features usage

(32)

22

Appendix 9.B

Voice International Usage VoIP Usage Voice International Roaming

Usage

Voice 3G Usage Data 3G Usage GPRS International Roaming

Usage

SMS International Roaming Usage

GPRS Usage MMS Usage

(33)

23

Appendix 10.A ROC curve for model of ‘at risk’

Appendix 10.B Plot of sensitivity and specificity versus all possible cut off points in the model of ‘at risk’

Cut off point

Sens

it

ivi

ty

an

d

s

p

ecifi

(34)

24

Appendix 11.A Probability to categorized at risk by all possible categories of voice domestic usage

* The other variables were constant at category 1

Appendix 11.B Probability to categorized at risk by all possible categories of SMS usage

* The other variables were constant at category 1 Category of SMS usage

Prob

abil

ity

o

f ‘at

ri

sk’

Category of voice domestic usage

Prob

abil

ity

o

f ‘at

ri

(35)