For the source code :- GitHub

From RPubs :- rpubs

This data set from the KAGGLE website which provider of varies datasets. This data set basically belongs to the people who are bank churners. Here we have 10127 examples and 23 features of the data set. But after cleansing the data we can setup the data set with 7081 examples and 20 features (with response variable).
Here I tried to predict the client income category( more than 60K & less than 60K dollars) with using some different classification machine learning algorithms such as,
1. K Nearest Neighbors (KNN)

2. Naive Bayes (NB)

3. Decision Tree (DT)

4. Random Forest (RF)

Let's begin by loading the required, needed libraries and importing the BankChurners.csv datasheet.

library(tidyverse)
library(class)
library(caret)
library(e1071)
library(C50)
library(randomForest)

df <- read.csv("BankChurners.csv", stringsAsFactors = TRUE)

Cleaning the data set

There is some unknown data of the data set. For performing good accurate algorithms we have to remove those. Even though we lose good features, it doesn't matter we have a big data collection. As well as we have to remove some columns which are collected as not valuable for this models.

df <- df %>% select(-1, -22, -23)
df <- df %>% filter(!Education_Level == "Unknown") %>%
             filter(!Marital_Status == "Unknown") %>%
             filter(!Income_Category == "Unknown")
str(df)

## 'data.frame':    7081 obs. of  20 variables:
##  $ Attrition_Flag          : Factor w/ 2 levels "Attrited Customer",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Customer_Age            : int  45 49 51 40 44 37 48 56 57 48 ...
##  $ Gender                  : Factor w/ 2 levels "F","M": 2 1 2 2 2 2 2 2 1 2 ...
##  $ Dependent_count         : int  3 5 3 3 2 3 2 1 2 4 ...
##  $ Education_Level         : Factor w/ 7 levels "College","Doctorate",..: 4 3 3 6 3 6 3 1 3 5 ...
##  $ Marital_Status          : Factor w/ 4 levels "Divorced","Married",..: 2 3 2 2 2 3 3 3 2 3 ...
##  $ Income_Category         : Factor w/ 6 levels "$120K +","$40K - $60K",..: 3 5 4 3 2 3 4 4 5 4 ...
##  $ Card_Category           : Factor w/ 4 levels "Blue","Gold",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Months_on_book          : int  39 44 36 21 36 36 36 36 48 36 ...
##  $ Total_Relationship_Count: int  5 6 4 5 3 5 6 3 5 6 ...
##  $ Months_Inactive_12_mon  : int  1 1 1 1 1 2 3 6 2 2 ...
##  $ Contacts_Count_12_mon   : int  3 2 0 0 2 0 3 0 2 3 ...
##  $ Credit_Limit            : num  12691 8256 3418 4716 4010 ...
##  $ Total_Revolving_Bal     : int  777 864 0 0 1247 2517 1677 0 680 2362 ...
##  $ Avg_Open_To_Buy         : num  11914 7392 3418 4716 2763 ...
##  $ Total_Amt_Chng_Q4_Q1    : num  1.33 1.54 2.59 2.17 1.38 ...
##  $ Total_Trans_Amt         : int  1144 1291 1887 816 1088 1350 1441 1539 1570 1671 ...
##  $ Total_Trans_Ct          : int  42 33 20 28 24 24 32 17 29 27 ...
##  $ Total_Ct_Chng_Q4_Q1     : num  1.625 3.714 2.333 2.5 0.846 ...
##  $ Avg_Utilization_Ratio   : num  0.061 0.105 0 0 0.311 0.113 0.144 0 0.279 0.078 ...

Visualization some basic categorical features

df %>% ggplot(aes(x = Attrition_Flag, fill = Attrition_Flag)) +
  geom_bar() +
  scale_y_continuous(breaks = seq(0, 6000, by = 1000)) +
  labs(title = "Clients of attrition", 
       y = "Number of clients")

Just looking at the type of attrition we can see existing clients are very biggr than attrited clients.

Hmisc::describe(df[2:3])

## df[2:3] 
## 
##  2  Variables      7081  Observations
## --------------------------------------------------------------------------------
## Customer_Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     7081        0       45    0.999    46.35    9.139       33       36 
##      .25      .50      .75      .90      .95 
##       41       46       52       57       60 
## 
## lowest : 26 27 28 29 30, highest: 66 67 68 70 73
## --------------------------------------------------------------------------------
## Gender 
##        n  missing distinct 
##     7081        0        2 
##                       
## Value          F     M
## Frequency   3375  3706
## Proportion 0.477 0.523
## --------------------------------------------------------------------------------

Here we can observed nearly same portion of gender involve this data collection and the male category is high some little bit.
By looking at the clients age we can see there are 70+ persons and there are no persons who not below the 25 years old. The mean average age is 46 years.

df %>% ggplot(aes(x = Education_Level, fill = Gender)) +
  geom_bar() + facet_wrap(~ Gender) +
  labs(title = "Education levels accordingto the Gender", 
       y = "Number of clients") +
  coord_polar() +
  theme_grey()

There are most client has a degree. There are few of doctors and post graduate.

df %>% ggplot(aes(x = Marital_Status, fill = Dependent_count)) +
  geom_bar() + facet_wrap(~ Dependent_count, nrow = 2) +
  labs(title = "Marital status and their dependents", 
       y = "Number of clients") +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 1000, by = 100)) +
  theme_grey()

We can see high evaluation of dependents have for the married clients which are 2 and 3 dependents.

Hmisc::describe(df[8])

## df[8] 
## 
##  1  Variables      7081  Observations
## --------------------------------------------------------------------------------
## Card_Category 
##        n  missing distinct 
##     7081        0        4 
##                                               
## Value          Blue     Gold Platinum   Silver
## Frequency      6598       81       11      391
## Proportion    0.932    0.011    0.002    0.055
## --------------------------------------------------------------------------------

Most people have Blue cards and few of have Platinum cards as 0.002 proportion from the 7081 clients.

Exploring and preparing the data

For the get sensitive features here we transform all the categories into the numeric format. Then we can use them also as our modeling features.

We recode the data as 1 for "Existing Customer" & 0 for "Attracted Customer".

df$Attrition_Flag <- ifelse(df$Attrition_Flag == "Existing Customer", 1, 0)

We recode the data as 1 for "Male" & 0 for "Female".

df$Gender <- ifelse(df$Gender == "M", 1, 0)

We recode here as 1 - "Doctorate", 2 - "Graduate", 3 - "Post-Graduate", 4 - "High School", 5 - "College" & 0 for "Uneducated".

df$Education_Level <- ifelse(df$Education_Level == "Doctorate", 1,
                             ifelse(df$Education_Level == "Graduate", 2,
                                    ifelse(df$Education_Level == "Post-Graduate", 3,
                                           ifelse(df$Education_Level == "High School", 4,
                                                  ifelse(df$Education_Level == "College", 5,0)))))

Recode here as 1 - "Married", 2 - "Single" & 0 for "Divorced".

df$Marital_Status <- ifelse(df$Marital_Status == "Married", 1,
                            ifelse(df$Marital_Status == "Single", 2, 0))

Here the output variable and it has five different categories. We recode those into two categories.
Which as earning the income more than 60K and less than 60K dollers,
+ 0 - less than $60K
+ 1 - more than $60K

df$Income_Category <- ifelse(df$Income_Category == "$120K +", 1,
                             ifelse(df$Income_Category == "$80K - $120K", 1,
                                    ifelse(df$Income_Category == "$60K - $80K", 1, 0)))

Here we recode the card categories as
+ 1 - Platinum
+ 2 - Gold
+ 3 - Silver
+ 4 - Blue

df$Card_Category <- ifelse(df$Card_Category == "Platinum", 1,
                           ifelse(df$Card_Category == "Gold", 2,
                                  ifelse(df$Card_Category == "Silver", 3,4)))

Let's see the how response feature categorize as getting income as more than 60K and less than 60K dollers.

round(prop.table(table(df$Income_Category)) * 100, digits = 1)

## 
##    0    1 
## 59.4 40.6

As nearly 60% of the clients have the income less than the 60K dollers.

Performing Machine Learning Algorithms

Segregating data into training and testing perposes.

set.seed(1)
intrain <- createDataPartition(df$Income_Category, p = 0.75, list = FALSE)
training <- df[intrain, ]
testing <- df[-intrain, ]

Transformation & normalizing data

normalize <- function(x){
  return ((x - min(x)) / (max(x) - min(x)))
}

train_N <- as.data.frame(lapply(training[-7], normalize)) 
test_N <- as.data.frame(lapply(testing[-7], normalize)) 

train_Lab <- as.factor(training$Income_Category)
test_Lab <- as.factor(testing$Income_Category)

K Nearest Neighbors (KNN)

Finding best K value

k_values <- c(1:100)
accuracy <- vector("double", length(k_values))
for (i in seq_along(k_values)) {
  model <- knn(train = train_N, test = test_N, cl = train_Lab, k = i)
  q <- confusionMatrix(model, test_Lab)
  accuracy[[i]] <- q[["overall"]][["Accuracy"]]*100
}

acc_table_N <- data.frame(k_values, accuracy)

Plotting the variation

ggplot(data = acc_table_N) +
  geom_point(mapping = aes(x = k_values, y = accuracy), color = "red") +
  geom_line(mapping = aes(x = k_values, y = accuracy))+
  scale_y_continuous(breaks = seq(85, 89, by = 0.25)) +
  scale_x_continuous(breaks = seq(1,100, by = 4)) +
  labs(title = "Model accuracy variation with different K values")

Highest accuracy best k values

head(arrange(acc_table_N, desc(accuracy)))

##   k_values accuracy
## 1       34 88.41808
## 2       40 88.41808
## 3       41 88.36158
## 4       35 88.24859
## 5       36 88.24859
## 6       39 88.24859

Training the model according to the best K value & predicting using knn() function.

kNN_pred <- knn(train = train_N, test = test_N, 
                 cl = train_Lab, k = arrange(acc_table_N, desc(accuracy))[1,1])

Evaluating model performance by confusionMatrix()

confusionMatrix(kNN_pred, test_Lab)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 873  10
##          1 197 690
##                                           
##                Accuracy : 0.8831          
##                  95% CI : (0.8672, 0.8977)
##     No Information Rate : 0.6045          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7662          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8159          
##             Specificity : 0.9857          
##          Pos Pred Value : 0.9887          
##          Neg Pred Value : 0.7779          
##              Prevalence : 0.6045          
##          Detection Rate : 0.4932          
##    Detection Prevalence : 0.4989          
##       Balanced Accuracy : 0.9008          
##                                           
##        'Positive' Class : 0               
##

Improving model performance with changing the feature values for standardization.

train_S <- as.data.frame(lapply(training[-7], scale)) 
test_S <- as.data.frame(lapply(testing[-7], scale))

Finding best K value for standardization features.

accuracy <- vector("double", length(k_values))
for (i in seq_along(k_values)) {
  model <- knn(train = train_S, test = test_S, cl = train_Lab, k = i)
  q <- confusionMatrix(model, test_Lab)
  accuracy[[i]] <- q[["overall"]][["Accuracy"]]*100
}

acc_table_S <- data.frame(k_values, accuracy)

Plotting the variation

ggplot(data = acc_table_S) +
  geom_point(mapping = aes(x = k_values, y = accuracy), color = "red") +
  geom_line(mapping = aes(x = k_values, y = accuracy))+
  scale_y_continuous(breaks = seq(85, 89, by = 0.25)) +
  scale_x_continuous(breaks = seq(1,100, by = 4)) +
  labs(title = "Model accuracy variation with different K values",
       subtitle = "with Standardization values")

Highest accuracy best k values by standardization features.

head(arrange(acc_table_S, desc(accuracy)))

##   k_values accuracy
## 1       13 88.81356
## 2       11 88.64407
## 3       42 88.64407
## 4       14 88.58757
## 5       17 88.58757
## 6       30 88.53107

Training the improved model & predicting by selected best K value.

kNN_pred_S <- knn(train = train_S, test = test_S, 
                 cl = train_Lab, k = arrange(acc_table_S, desc(accuracy))[1,1])

Evaluating model performance

confusionMatrix(kNN_pred_S, test_Lab)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 919  47
##          1 151 653
##                                           
##                Accuracy : 0.8881          
##                  95% CI : (0.8725, 0.9024)
##     No Information Rate : 0.6045          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7719          
##                                           
##  Mcnemar's Test P-Value : 2.482e-13       
##                                           
##             Sensitivity : 0.8589          
##             Specificity : 0.9329          
##          Pos Pred Value : 0.9513          
##          Neg Pred Value : 0.8122          
##              Prevalence : 0.6045          
##          Detection Rate : 0.5192          
##    Detection Prevalence : 0.5458          
##       Balanced Accuracy : 0.8959          
##                                           
##        'Positive' Class : 0               
##

Naive Bayes (NB)

Training the model using naiveBayes() function & Predicting the values from training NB model.

nb_model <- naiveBayes(training[-7], train_Lab)
nb_pred <- predict(nb_model, testing[-7])

Evaluating model performance

confusionMatrix(nb_pred, test_Lab)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 835   5
##          1 235 695
##                                         
##                Accuracy : 0.8644        
##                  95% CI : (0.8476, 0.88)
##     No Information Rate : 0.6045        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7317        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.7804        
##             Specificity : 0.9929        
##          Pos Pred Value : 0.9940        
##          Neg Pred Value : 0.7473        
##              Prevalence : 0.6045        
##          Detection Rate : 0.4718        
##    Detection Prevalence : 0.4746        
##       Balanced Accuracy : 0.8866        
##                                         
##        'Positive' Class : 0             
##

Decision Tree (DT)

Training the model using C5.0() function & Predicting the values from training DT model.

dT_model <- C5.0(training[-7], train_Lab)
dT_pred <- predict(dT_model, testing[-7])

Evaluating model performance

confusionMatrix(dT_pred, test_Lab)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 918  28
##          1 152 672
##                                          
##                Accuracy : 0.8983         
##                  95% CI : (0.8833, 0.912)
##     No Information Rate : 0.6045         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7936         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.8579         
##             Specificity : 0.9600         
##          Pos Pred Value : 0.9704         
##          Neg Pred Value : 0.8155         
##              Prevalence : 0.6045         
##          Detection Rate : 0.5186         
##    Detection Prevalence : 0.5345         
##       Balanced Accuracy : 0.9090         
##                                          
##        'Positive' Class : 0              
##

To get more accuracy from the model we have to tune the trail parameter.
Finding best trial number for improving model performance by tuning trials parameter.

trials_values <- c(1:30)
dt_accuracy <- vector("double", length(trials_values))
for (i in seq_along(trials_values)) {
  dT_model <- C5.0(training[-7], train_Lab, trials = i)
  dT_pred <- predict(dT_model, testing[-7])
  q <- confusionMatrix(dT_pred, test_Lab)
  dt_accuracy[[i]] <- q[["overall"]][["Accuracy"]]*100
}

acc_table_DT <- data.frame(trials_values, dt_accuracy)

Plotting the variation

ggplot(data = acc_table_DT) +
  geom_point(mapping = aes(x = trials_values, y = dt_accuracy), color = "green", size = 6) +
  geom_line(mapping = aes(x = trials_values, y = dt_accuracy))+
  scale_y_continuous(breaks = seq(85, 93, by = 0.25)) +
  scale_x_continuous(breaks = seq(1,30, by = 1)) +
  labs(title = "  DT Model accuracy variation with different trials values")

Highest accuracy best trail values

head(arrange(acc_table_DT, desc(dt_accuracy)))

##   trials_values dt_accuracy
## 1            18    90.45198
## 2            26    90.39548
## 3            10    90.33898
## 4             3    90.28249
## 5            16    90.22599
## 6            23    90.22599

Training & predicting improved model by selected best trail value.

dT_model <- C5.0(training[-7], train_Lab, 
                 trials = arrange(acc_table_DT, desc(dt_accuracy))[1,1])
dT_pred <- predict(dT_model, testing[-7])

Evaluating model performance

confusionMatrix(dT_pred, test_Lab)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 933  32
##          1 137 668
##                                           
##                Accuracy : 0.9045          
##                  95% CI : (0.8899, 0.9178)
##     No Information Rate : 0.6045          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8054          
##                                           
##  Mcnemar's Test P-Value : 1.244e-15       
##                                           
##             Sensitivity : 0.8720          
##             Specificity : 0.9543          
##          Pos Pred Value : 0.9668          
##          Neg Pred Value : 0.8298          
##              Prevalence : 0.6045          
##          Detection Rate : 0.5271          
##    Detection Prevalence : 0.5452          
##       Balanced Accuracy : 0.9131          
##                                           
##        'Positive' Class : 0               
##

Random Forest (RF)

WE can perform a good accuracy model by selecting good number of trees (here we used 500 trees) and setting best mtry value.
Finding best mtry value for tune the parameters

mtry_values <- c(2:15)
rf_accuracy <- vector("double", length(mtry_values))
for (i in seq_along(mtry_values)) {
  rf_model <- randomForest(training[-7], train_Lab, ntree = 500, mtry = i)
  rf_pred <- predict(rf_model, testing[-7])
  q <- confusionMatrix(rf_pred, test_Lab)
  rf_accuracy[[i]] <- q[["overall"]][["Accuracy"]]*100
}

acc_table_RF <- data.frame(mtry_values, rf_accuracy)

Plotting the variation

ggplot(data = acc_table_RF) +
  geom_point(mapping = aes(x = mtry_values, y = rf_accuracy), color = "cyan", size = 6) +
  geom_line(mapping = aes(x = mtry_values, y = rf_accuracy))+
  scale_y_continuous(breaks = seq(85, 93, by = 0.25)) +
  scale_x_continuous(breaks = seq(2,15, by = 1)) +
  labs(title = "  RF Model accuracy variation with different mtry values")

Highest accuracy best mtry values

head(arrange(acc_table_RF, desc(rf_accuracy)))

##   mtry_values rf_accuracy
## 1          15    90.33898
## 2          11    90.16949
## 3           8    90.11299
## 4           6    90.05650
## 5          13    90.00000
## 6           9    89.88701

Training & predicting improved model by selected best mtry value.

rf_model <- randomForest(training[-7], train_Lab, 
                         ntree = 500, mtry = arrange(acc_table_RF, desc(rf_accuracy))[1,1])
rf_pred <- predict(rf_model, testing[-7])

Evaluating model performance

confusionMatrix(rf_pred, test_Lab)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 928  30
##          1 142 670
##                                           
##                Accuracy : 0.9028          
##                  95% CI : (0.8881, 0.9162)
##     No Information Rate : 0.6045          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8022          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8673          
##             Specificity : 0.9571          
##          Pos Pred Value : 0.9687          
##          Neg Pred Value : 0.8251          
##              Prevalence : 0.6045          
##          Detection Rate : 0.5243          
##    Detection Prevalence : 0.5412          
##       Balanced Accuracy : 0.9122          
##                                           
##        'Positive' Class : 0               
##

So finally we perform some different machine learning algorithms for predict the income group of the clients.
Through the those algorithms we can see how the accuracy variation of those models.
So we can create a summarize accuracy table for the all models as below.

models <- c("KNN", "Naive Bayes", "Decision Tree", "Random Forest")

accuracies <- c(confusionMatrix(kNN_pred_S, test_Lab)[["overall"]][["Accuracy"]]*100,
                confusionMatrix(nb_pred, test_Lab)[["overall"]][["Accuracy"]]*100,
                confusionMatrix(dT_pred, test_Lab)[["overall"]][["Accuracy"]]*100,
                confusionMatrix(rf_pred, test_Lab)[["overall"]][["Accuracy"]]*100)

(summary_acc_table <- data.frame("Models" = models, "Accuracy" = accuracies))

##          Models Accuracy
## 1           KNN 88.81356
## 2   Naive Bayes 86.44068
## 3 Decision Tree 90.45198
## 4 Random Forest 90.28249

Plotting the results from the summarize accuracy table.

summary_acc_table %>% ggplot(mapping = aes(x = Models, y = Accuracy)) + 
                               geom_col(fill = c("aquamarine2", "aquamarine1",
                                                 "aquamarine4", "aquamarine3")) +
  coord_cartesian(ylim = c(86, 91)) +
  scale_y_continuous(breaks = seq(86, 91, by = 0.5)) +
  labs(title = "Summarize accuracies of 4 Algorithms") +
  theme_linedraw()

Conclusion

Then the finally as a result of my task, I can just tell to community the best machine learning algorithm for the bank churners for predicting the income category(as less than 60K and more than 60K dollers)is the Decision Tree machine learning algorithm.

STATISTICS RESEARCHES REPORTS

Bank Churners Income Prediction

Cleaning the data set

Visualization some basic categorical features

Exploring and preparing the data

Performing Machine Learning Algorithms

K Nearest Neighbors (KNN)

Finding best K value for standardization features.

Evaluating model performance

Naive Bayes (NB)

Decision Tree (DT)

Random Forest (RF)

Plotting the variation

Conclusion

Post a Comment

0 Comments

Followers

INTRODUCTION

CONECTIONS

Contact Form

Menu Footer Widget