For the source code :- GitHub
From RPubs :- rpubs
This data set from the KAGGLE website which provider of varies datasets. This data set basically belongs to the people who are bank churners. Here we have 10127 examples and 23 features of the data set. But after cleansing the data we can setup the data set with 7081 examples and 20 features (with response variable).
Here I tried to predict the client income category( more than 60K & less than 60K dollars) with using some different classification machine learning algorithms such as,
1. K Nearest Neighbors (KNN)
2. Naive Bayes (NB)
3. Decision Tree (DT)
4. Random Forest (RF)
Let's begin by loading the required, needed libraries and importing the BankChurners.csv datasheet.
library(tidyverse)
library(class)
library(caret)
library(e1071)
library(C50)
library(randomForest)
df <- read.csv("BankChurners.csv", stringsAsFactors = TRUE)
Cleaning the data set
There is some unknown data of the data set. For performing good accurate algorithms we have to remove those. Even though we lose good features, it doesn't matter we have a big data collection. As well as we have to remove some columns which are collected as not valuable for this models.
df <- df %>% select(-1, -22, -23)
df <- df %>% filter(!Education_Level == "Unknown") %>%
filter(!Marital_Status == "Unknown") %>%
filter(!Income_Category == "Unknown")
str(df)
## 'data.frame': 7081 obs. of 20 variables:
## $ Attrition_Flag : Factor w/ 2 levels "Attrited Customer",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Customer_Age : int 45 49 51 40 44 37 48 56 57 48 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 1 2 2 2 2 2 2 1 2 ...
## $ Dependent_count : int 3 5 3 3 2 3 2 1 2 4 ...
## $ Education_Level : Factor w/ 7 levels "College","Doctorate",..: 4 3 3 6 3 6 3 1 3 5 ...
## $ Marital_Status : Factor w/ 4 levels "Divorced","Married",..: 2 3 2 2 2 3 3 3 2 3 ...
## $ Income_Category : Factor w/ 6 levels "$120K +","$40K - $60K",..: 3 5 4 3 2 3 4 4 5 4 ...
## $ Card_Category : Factor w/ 4 levels "Blue","Gold",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Months_on_book : int 39 44 36 21 36 36 36 36 48 36 ...
## $ Total_Relationship_Count: int 5 6 4 5 3 5 6 3 5 6 ...
## $ Months_Inactive_12_mon : int 1 1 1 1 1 2 3 6 2 2 ...
## $ Contacts_Count_12_mon : int 3 2 0 0 2 0 3 0 2 3 ...
## $ Credit_Limit : num 12691 8256 3418 4716 4010 ...
## $ Total_Revolving_Bal : int 777 864 0 0 1247 2517 1677 0 680 2362 ...
## $ Avg_Open_To_Buy : num 11914 7392 3418 4716 2763 ...
## $ Total_Amt_Chng_Q4_Q1 : num 1.33 1.54 2.59 2.17 1.38 ...
## $ Total_Trans_Amt : int 1144 1291 1887 816 1088 1350 1441 1539 1570 1671 ...
## $ Total_Trans_Ct : int 42 33 20 28 24 24 32 17 29 27 ...
## $ Total_Ct_Chng_Q4_Q1 : num 1.625 3.714 2.333 2.5 0.846 ...
## $ Avg_Utilization_Ratio : num 0.061 0.105 0 0 0.311 0.113 0.144 0 0.279 0.078 ...
Visualization some basic categorical features
df %>% ggplot(aes(x = Attrition_Flag, fill = Attrition_Flag)) +
geom_bar() +
scale_y_continuous(breaks = seq(0, 6000, by = 1000)) +
labs(title = "Clients of attrition",
y = "Number of clients")
Just looking at the type of attrition we can see existing clients are very biggr than attrited clients.
Hmisc::describe(df[2:3])
## df[2:3]
##
## 2 Variables 7081 Observations
## --------------------------------------------------------------------------------
## Customer_Age
## n missing distinct Info Mean Gmd .05 .10
## 7081 0 45 0.999 46.35 9.139 33 36
## .25 .50 .75 .90 .95
## 41 46 52 57 60
##
## lowest : 26 27 28 29 30, highest: 66 67 68 70 73
## --------------------------------------------------------------------------------
## Gender
## n missing distinct
## 7081 0 2
##
## Value F M
## Frequency 3375 3706
## Proportion 0.477 0.523
## --------------------------------------------------------------------------------
Here we can observed nearly same portion of gender involve this data collection and the male category is high some little bit.
By looking at the clients age we can see there are 70+ persons and there are no persons who not below the 25 years old. The mean average age is 46 years.
df %>% ggplot(aes(x = Education_Level, fill = Gender)) +
geom_bar() + facet_wrap(~ Gender) +
labs(title = "Education levels accordingto the Gender",
y = "Number of clients") +
coord_polar() +
theme_grey()
There are most client has a degree. There are few of doctors and post graduate.
df %>% ggplot(aes(x = Marital_Status, fill = Dependent_count)) +
geom_bar() + facet_wrap(~ Dependent_count, nrow = 2) +
labs(title = "Marital status and their dependents",
y = "Number of clients") +
coord_flip() +
scale_y_continuous(breaks = seq(0, 1000, by = 100)) +
theme_grey()
We can see high evaluation of dependents have for the married clients which are 2 and 3 dependents.
Hmisc::describe(df[8])
## df[8]
##
## 1 Variables 7081 Observations
## --------------------------------------------------------------------------------
## Card_Category
## n missing distinct
## 7081 0 4
##
## Value Blue Gold Platinum Silver
## Frequency 6598 81 11 391
## Proportion 0.932 0.011 0.002 0.055
## --------------------------------------------------------------------------------
Most people have Blue cards and few of have Platinum cards as 0.002 proportion from the 7081 clients.
Exploring and preparing the data
For the get sensitive features here we transform all the categories into the numeric format. Then we can use them also as our modeling features.
We recode the data as 1 for "Existing Customer" & 0 for "Attracted Customer".
df$Attrition_Flag <- ifelse(df$Attrition_Flag == "Existing Customer", 1, 0)
We recode the data as 1 for "Male" & 0 for "Female".
df$Gender <- ifelse(df$Gender == "M", 1, 0)
We recode here as 1 - "Doctorate", 2 - "Graduate", 3 - "Post-Graduate", 4 - "High School", 5 - "College" & 0 for "Uneducated".
df$Education_Level <- ifelse(df$Education_Level == "Doctorate", 1,
ifelse(df$Education_Level == "Graduate", 2,
ifelse(df$Education_Level == "Post-Graduate", 3,
ifelse(df$Education_Level == "High School", 4,
ifelse(df$Education_Level == "College", 5,0)))))
Recode here as 1 - "Married", 2 - "Single" & 0 for "Divorced".
df$Marital_Status <- ifelse(df$Marital_Status == "Married", 1,
ifelse(df$Marital_Status == "Single", 2, 0))
Here the output variable and it has five different categories. We recode those into two categories.
Which as earning the income more than 60K and less than 60K dollers,
+ 0 - less than $60K
+ 1 - more than $60K
df$Income_Category <- ifelse(df$Income_Category == "$120K +", 1,
ifelse(df$Income_Category == "$80K - $120K", 1,
ifelse(df$Income_Category == "$60K - $80K", 1, 0)))
Here we recode the card categories as
+ 1 - Platinum
+ 2 - Gold
+ 3 - Silver
+ 4 - Blue
df$Card_Category <- ifelse(df$Card_Category == "Platinum", 1,
ifelse(df$Card_Category == "Gold", 2,
ifelse(df$Card_Category == "Silver", 3,4)))
Let's see the how response feature categorize as getting income as more than 60K and less than 60K dollers.
round(prop.table(table(df$Income_Category)) * 100, digits = 1)
##
## 0 1
## 59.4 40.6
As nearly 60% of the clients have the income less than the 60K dollers.
Performing Machine Learning Algorithms
Segregating data into training and testing perposes.
set.seed(1)
intrain <- createDataPartition(df$Income_Category, p = 0.75, list = FALSE)
training <- df[intrain, ]
testing <- df[-intrain, ]
Transformation & normalizing data
normalize <- function(x){
return ((x - min(x)) / (max(x) - min(x)))
}
train_N <- as.data.frame(lapply(training[-7], normalize))
test_N <- as.data.frame(lapply(testing[-7], normalize))
train_Lab <- as.factor(training$Income_Category)
test_Lab <- as.factor(testing$Income_Category)
K Nearest Neighbors (KNN)
Finding best K value
k_values <- c(1:100)
accuracy <- vector("double", length(k_values))
for (i in seq_along(k_values)) {
model <- knn(train = train_N, test = test_N, cl = train_Lab, k = i)
q <- confusionMatrix(model, test_Lab)
accuracy[[i]] <- q[["overall"]][["Accuracy"]]*100
}
acc_table_N <- data.frame(k_values, accuracy)
Plotting the variation
ggplot(data = acc_table_N) +
geom_point(mapping = aes(x = k_values, y = accuracy), color = "red") +
geom_line(mapping = aes(x = k_values, y = accuracy))+
scale_y_continuous(breaks = seq(85, 89, by = 0.25)) +
scale_x_continuous(breaks = seq(1,100, by = 4)) +
labs(title = "Model accuracy variation with different K values")
Highest accuracy best k values
head(arrange(acc_table_N, desc(accuracy)))
## k_values accuracy
## 1 34 88.41808
## 2 40 88.41808
## 3 41 88.36158
## 4 35 88.24859
## 5 36 88.24859
## 6 39 88.24859
Training the model according to the best K value & predicting using knn() function.
kNN_pred <- knn(train = train_N, test = test_N,
cl = train_Lab, k = arrange(acc_table_N, desc(accuracy))[1,1])
Evaluating model performance by confusionMatrix()
confusionMatrix(kNN_pred, test_Lab)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 873 10
## 1 197 690
##
## Accuracy : 0.8831
## 95% CI : (0.8672, 0.8977)
## No Information Rate : 0.6045
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7662
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8159
## Specificity : 0.9857
## Pos Pred Value : 0.9887
## Neg Pred Value : 0.7779
## Prevalence : 0.6045
## Detection Rate : 0.4932
## Detection Prevalence : 0.4989
## Balanced Accuracy : 0.9008
##
## 'Positive' Class : 0
##
Improving model performance with changing the feature values for standardization.
train_S <- as.data.frame(lapply(training[-7], scale))
test_S <- as.data.frame(lapply(testing[-7], scale))
Finding best K value for standardization features.
accuracy <- vector("double", length(k_values))
for (i in seq_along(k_values)) {
model <- knn(train = train_S, test = test_S, cl = train_Lab, k = i)
q <- confusionMatrix(model, test_Lab)
accuracy[[i]] <- q[["overall"]][["Accuracy"]]*100
}
acc_table_S <- data.frame(k_values, accuracy)
Plotting the variation
ggplot(data = acc_table_S) +
geom_point(mapping = aes(x = k_values, y = accuracy), color = "red") +
geom_line(mapping = aes(x = k_values, y = accuracy))+
scale_y_continuous(breaks = seq(85, 89, by = 0.25)) +
scale_x_continuous(breaks = seq(1,100, by = 4)) +
labs(title = "Model accuracy variation with different K values",
subtitle = "with Standardization values")
Highest accuracy best k values by standardization features.
head(arrange(acc_table_S, desc(accuracy)))
## k_values accuracy
## 1 13 88.81356
## 2 11 88.64407
## 3 42 88.64407
## 4 14 88.58757
## 5 17 88.58757
## 6 30 88.53107
Training the improved model & predicting by selected best K value.
kNN_pred_S <- knn(train = train_S, test = test_S,
cl = train_Lab, k = arrange(acc_table_S, desc(accuracy))[1,1])
Evaluating model performance
confusionMatrix(kNN_pred_S, test_Lab)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 919 47
## 1 151 653
##
## Accuracy : 0.8881
## 95% CI : (0.8725, 0.9024)
## No Information Rate : 0.6045
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7719
##
## Mcnemar's Test P-Value : 2.482e-13
##
## Sensitivity : 0.8589
## Specificity : 0.9329
## Pos Pred Value : 0.9513
## Neg Pred Value : 0.8122
## Prevalence : 0.6045
## Detection Rate : 0.5192
## Detection Prevalence : 0.5458
## Balanced Accuracy : 0.8959
##
## 'Positive' Class : 0
##
Naive Bayes (NB)
Training the model using naiveBayes() function & Predicting the values from training NB model.
nb_model <- naiveBayes(training[-7], train_Lab)
nb_pred <- predict(nb_model, testing[-7])
Evaluating model performance
confusionMatrix(nb_pred, test_Lab)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 835 5
## 1 235 695
##
## Accuracy : 0.8644
## 95% CI : (0.8476, 0.88)
## No Information Rate : 0.6045
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7317
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7804
## Specificity : 0.9929
## Pos Pred Value : 0.9940
## Neg Pred Value : 0.7473
## Prevalence : 0.6045
## Detection Rate : 0.4718
## Detection Prevalence : 0.4746
## Balanced Accuracy : 0.8866
##
## 'Positive' Class : 0
##
Decision Tree (DT)
Training the model using C5.0() function & Predicting the values from training DT model.
dT_model <- C5.0(training[-7], train_Lab)
dT_pred <- predict(dT_model, testing[-7])
Evaluating model performance
confusionMatrix(dT_pred, test_Lab)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 918 28
## 1 152 672
##
## Accuracy : 0.8983
## 95% CI : (0.8833, 0.912)
## No Information Rate : 0.6045
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7936
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8579
## Specificity : 0.9600
## Pos Pred Value : 0.9704
## Neg Pred Value : 0.8155
## Prevalence : 0.6045
## Detection Rate : 0.5186
## Detection Prevalence : 0.5345
## Balanced Accuracy : 0.9090
##
## 'Positive' Class : 0
##
To get more accuracy from the model we have to tune the trail parameter.
Finding best trial number for improving model performance by tuning trials parameter.
trials_values <- c(1:30)
dt_accuracy <- vector("double", length(trials_values))
for (i in seq_along(trials_values)) {
dT_model <- C5.0(training[-7], train_Lab, trials = i)
dT_pred <- predict(dT_model, testing[-7])
q <- confusionMatrix(dT_pred, test_Lab)
dt_accuracy[[i]] <- q[["overall"]][["Accuracy"]]*100
}
acc_table_DT <- data.frame(trials_values, dt_accuracy)
Plotting the variation
ggplot(data = acc_table_DT) +
geom_point(mapping = aes(x = trials_values, y = dt_accuracy), color = "green", size = 6) +
geom_line(mapping = aes(x = trials_values, y = dt_accuracy))+
scale_y_continuous(breaks = seq(85, 93, by = 0.25)) +
scale_x_continuous(breaks = seq(1,30, by = 1)) +
labs(title = " DT Model accuracy variation with different trials values")
Highest accuracy best trail values
head(arrange(acc_table_DT, desc(dt_accuracy)))
## trials_values dt_accuracy
## 1 18 90.45198
## 2 26 90.39548
## 3 10 90.33898
## 4 3 90.28249
## 5 16 90.22599
## 6 23 90.22599
Training & predicting improved model by selected best trail value.
dT_model <- C5.0(training[-7], train_Lab,
trials = arrange(acc_table_DT, desc(dt_accuracy))[1,1])
dT_pred <- predict(dT_model, testing[-7])
Evaluating model performance
confusionMatrix(dT_pred, test_Lab)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 933 32
## 1 137 668
##
## Accuracy : 0.9045
## 95% CI : (0.8899, 0.9178)
## No Information Rate : 0.6045
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8054
##
## Mcnemar's Test P-Value : 1.244e-15
##
## Sensitivity : 0.8720
## Specificity : 0.9543
## Pos Pred Value : 0.9668
## Neg Pred Value : 0.8298
## Prevalence : 0.6045
## Detection Rate : 0.5271
## Detection Prevalence : 0.5452
## Balanced Accuracy : 0.9131
##
## 'Positive' Class : 0
##
Random Forest (RF)
WE can perform a good accuracy model by selecting good number of trees (here we used 500 trees) and setting best mtry value.
Finding best mtry value for tune the parameters
mtry_values <- c(2:15)
rf_accuracy <- vector("double", length(mtry_values))
for (i in seq_along(mtry_values)) {
rf_model <- randomForest(training[-7], train_Lab, ntree = 500, mtry = i)
rf_pred <- predict(rf_model, testing[-7])
q <- confusionMatrix(rf_pred, test_Lab)
rf_accuracy[[i]] <- q[["overall"]][["Accuracy"]]*100
}
acc_table_RF <- data.frame(mtry_values, rf_accuracy)
Plotting the variation
ggplot(data = acc_table_RF) +
geom_point(mapping = aes(x = mtry_values, y = rf_accuracy), color = "cyan", size = 6) +
geom_line(mapping = aes(x = mtry_values, y = rf_accuracy))+
scale_y_continuous(breaks = seq(85, 93, by = 0.25)) +
scale_x_continuous(breaks = seq(2,15, by = 1)) +
labs(title = " RF Model accuracy variation with different mtry values")
Highest accuracy best mtry values
head(arrange(acc_table_RF, desc(rf_accuracy)))
## mtry_values rf_accuracy
## 1 15 90.33898
## 2 11 90.16949
## 3 8 90.11299
## 4 6 90.05650
## 5 13 90.00000
## 6 9 89.88701
Training & predicting improved model by selected best mtry value.
rf_model <- randomForest(training[-7], train_Lab,
ntree = 500, mtry = arrange(acc_table_RF, desc(rf_accuracy))[1,1])
rf_pred <- predict(rf_model, testing[-7])
Evaluating model performance
confusionMatrix(rf_pred, test_Lab)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 928 30
## 1 142 670
##
## Accuracy : 0.9028
## 95% CI : (0.8881, 0.9162)
## No Information Rate : 0.6045
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8022
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8673
## Specificity : 0.9571
## Pos Pred Value : 0.9687
## Neg Pred Value : 0.8251
## Prevalence : 0.6045
## Detection Rate : 0.5243
## Detection Prevalence : 0.5412
## Balanced Accuracy : 0.9122
##
## 'Positive' Class : 0
##
So finally we perform some different machine learning algorithms for predict the income group of the clients.
Through the those algorithms we can see how the accuracy variation of those models.
So we can create a summarize accuracy table for the all models as below.
models <- c("KNN", "Naive Bayes", "Decision Tree", "Random Forest")
accuracies <- c(confusionMatrix(kNN_pred_S, test_Lab)[["overall"]][["Accuracy"]]*100,
confusionMatrix(nb_pred, test_Lab)[["overall"]][["Accuracy"]]*100,
confusionMatrix(dT_pred, test_Lab)[["overall"]][["Accuracy"]]*100,
confusionMatrix(rf_pred, test_Lab)[["overall"]][["Accuracy"]]*100)
(summary_acc_table <- data.frame("Models" = models, "Accuracy" = accuracies))
## Models Accuracy
## 1 KNN 88.81356
## 2 Naive Bayes 86.44068
## 3 Decision Tree 90.45198
## 4 Random Forest 90.28249
Plotting the results from the summarize accuracy table.
summary_acc_table %>% ggplot(mapping = aes(x = Models, y = Accuracy)) +
geom_col(fill = c("aquamarine2", "aquamarine1",
"aquamarine4", "aquamarine3")) +
coord_cartesian(ylim = c(86, 91)) +
scale_y_continuous(breaks = seq(86, 91, by = 0.5)) +
labs(title = "Summarize accuracies of 4 Algorithms") +
theme_linedraw()
Conclusion
Then the finally as a result of my task, I can just tell to community the best machine learning algorithm for the bank churners for predicting the income category(as less than 60K and more than 60K dollers)is the Decision Tree machine learning algorithm.
0 Comments