Telecom churn prediction with SmartPredict
I - Business problem overview
When there is more company that provide the same services, the main problem that it occurs is the competition within businesses. Indeed, customers can choose from multiple service providers and actively switch from one company to another. Besides, acquiring a new customer is more expensive than retaining an existing one. So customer retention is crucial for a company and constitutes the main business goal.
Let's use the performance of the Intelligence Artificial to achieve this business objective.
The idea is to predict which customers are at high risk of churn by using SmartPredict (https://smartpredict.ai/).
II - Presentation of the project
The present project focus on the business telecom industry based on prepaid customers. The dataset can be found at https://www.kaggle.com/vijaysrikanth/telecom-churn-data-set-for-the-south-asian-market. The purpose of this project is to predict with XGBoost classifier model, the churn in September using data (features) from three months preceding it ( June, July, August). The project is limited to High-Value Customers.
III - Step by step lets resolve with SmartPredict flowchart
Performing a machine learning project on SmartPredict consists of building a flowchart by interconnecting appropriate modules for each step. The following figure shows the steps to perform for this project with the corresponding modules ( in blue) used for each step.
III.1 - Visualizing and understanding data on SmartPredict dataset processor
After uploading our dataset in SmartPredict, we can have a view of it with SmartApp for Data Processing and Visualization.
We have a dataset of 99999 rows and 226 columns of three types of objects: Integer, Float and String. Each row represents one client with a description of their activities from June to September. These months are noted as "6, 7, 8, 9" in suffix or as "jun, jul, aug, sep" in prefix respectively.
III.2 - Extracting features engineering with SmartPredict's modules and building flowchart
First of all, we drag our dataset on our workspace given that it constitutes the first module ( telecom_churn_data) of our flowchart.
In this project there are two main steps for the extraction features engineering with SmartPredict module :
- filtering high-value customer,
- creating target value churn
- Filtering high-value customer with the module DataSet Filter
Let's define the High-Value Customers (HVC) as customers who have an "Average recharge amount" equal or more than the 70th percentile of the average recharge of the two first months, so equal or greater than 478.
First, the missing values of the columns that are included for this calculus are replaced by a constant 0. For this we use the module Simple Imputer.
Then with one Dataset Transformer module, we calculate the total recharge amount for the 6th and 7th month and create the average recharge during these 2 months. So we apply the following two principles:
Total recharge amount in a month = (total recharge data * average recharge amount data) + Total recharge Amount
if any of the data recharge columns are 0 then retain the total recharge amount column as is it.
Average recharge amount = (total recharge amount in 6th month +total recharge amount in 7th month) / 2
The HVCs are filtered from this “Average recharge amount” feature with a Dataset Filter according to the rule mentioned above.
- Creating the target value "churn" with Dataset Transformer
We use customer's data during September to define whether they have unsubscribed or not. So with the Dataset Transformer we define the principle below to create a CHURN column which will constitute our target binary value for our prediction.
If total incoming and out_coming data and volume of 2g and 3g in September are less than or equal than 0, the churn value is 1 and 0 if not.
III-3 - Preparing data
Having an ideal data is the main feature to complete a machine learning project. So we must drop unuseful data especially the data in the 9th month and String data. Moreover, missing data are replaced by mean values for float data types and median values for int data types. For these steps we use the DataFrame loader/converter module and two Simple Imputer modules and a Missing data handler.
III-4- Splitting data
With a Generic Data Splitter module, we can split our data such as the churn column is the target label and the rest constitute the train and the test data. SmartPredict has also a Labeled data splitter module for splitting automatically train data from test data with a ratio of our choice, for this project we choose 90/10 ratio. In SmartPredict, normalization of training data can be done by the Normalizer module.
III-5- Building model
• As XGBoost has excellent accuracy and adapts well to all types of data and problems, we chose it to solve our problem, so we set up the XGBoost Classifier module, as follows :
- maximum tree depth = 8
- learning rate = 0.1
- number of estimators = 200
- objective function = binary logistic
- booster = gbtree
- To train the model, we use the Trainer ML models which take the training data as input as well as the target label from the Labeled data splitter module. A trained model will be available and can be saved in an ItemSaver module once the flowchart has run.
- Then we evaluate the model with Evaluator for ML module, which takes as inputs the trained model and the test data returned. We see that we obtain an accuracy of 0.94.
NB : A "Data Object Logger" module connected to the output of a module allows us to see logs.
Finally we get a flowchart shown below. In another blog we proceed to its deployment so that we can predict whether a customer will unsubscribe or not.
Through this project, we see that SmartPredict is a powerful and easy handling platform for a data science projects.