• Home
STATISTICS RESEARCHES REPORTS

STATISTICS RESEARCHES REPORTS

STATISTICS RESEARCHES REPORTS
  • Home
  • Academic Documents
  • _Factorial Design
  • __2 k Design
  • __Confounding
  • __ANCOVA
  • __Fractional Factorial
  • _Regression Analysis
  • __SLR & MLR
  • __Models Building(Stepwise / MLR / Non-Linear)
  • _Simulation Modeling
  • _Sampling
  • _Experiment Designs
  • Projects Analysis Reports
  • _Accommodations in SL
  • _Germination of Seeds
  • __Light spectrum
  • __Light Intensity
  • _Football Players Review
  • Machine Learning
  • _Supervised Learning
  • __Heart Diseases
  • __USA superstore profit
  • __Bank Churners Income
  • __South Asian Wireless Telecom
  • Documents
  • _Machine Learning Projects Checklist

USA SUPERSTORE PROFIT PREDICTION

 

PROFIT PREDICTION OF USA SUPERSTORE CUSTOMERS

Explanatory Data Analysis

Feature Engineering

ML Algorithms Training


Metadata

Row ID => Unique ID for each row.
Order ID => Unique Order ID for each Customer.
Order Date => Order Date of the product.
Ship Date => Shipping Date of the Product.
Ship Mode=> Shipping Mode specified by the Customer.
Customer ID => Unique ID to identify each Customer.
Customer Name => Name of the Customer.
Segment => The segment where the Customer belongs.
Country => Country of residence of the Customer.
City => City of residence of of the Customer.
State => State of residence of the Customer.
Postal Code => Postal Code of every Customer.
Region => Region where the Customer belong.
Product ID => Unique ID of the Product.
Category => Category of the product ordered.
Sub-Category => Sub-Category of the product ordered.
Product Name => Name of the Product.
Sales => Sales of the Product.
Quantity => Quantity of the Product.
Discount => Discount provided.
Profit => Profit/Loss incurred.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
warnings.filterwarnings('ignore')
sns.set_theme(style='whitegrid')
In [2]:
df = pd.read_csv('Superstore.csv', encoding='unicode_escape')
#df = pd.read_csv('Superstore.csv', engine='python')
df.head(2)
Out[2]:
Row IDOrder IDOrder DateShip DateShip ModeCustomer IDCustomer NameSegmentCountryCity...Postal CodeRegionProduct IDCategorySub-CategoryProduct NameSalesQuantityDiscountProfit
01CA-2016-15215611/8/201611/11/2016Second ClassCG-12520Claire GuteConsumerUnited StatesHenderson...42420SouthFUR-BO-10001798FurnitureBookcasesBush Somerset Collection Bookcase261.9620.041.9136
12CA-2016-15215611/8/201611/11/2016Second ClassCG-12520Claire GuteConsumerUnited StatesHenderson...42420SouthFUR-CH-10000454FurnitureChairsHon Deluxe Fabric Upholstered Stacking Chairs,...731.9430.0219.5820

2 rows × 21 columns

In [3]:
df.tail(2)
Out[3]:
Row IDOrder IDOrder DateShip DateShip ModeCustomer IDCustomer NameSegmentCountryCity...Postal CodeRegionProduct IDCategorySub-CategoryProduct NameSalesQuantityDiscountProfit
99929993CA-2017-1212582/26/20173/3/2017Standard ClassDB-13060Dave BrooksConsumerUnited StatesCosta Mesa...92627WestOFF-PA-10004041Office SuppliesPaperIt's Hot Message Books with Stickers, 2 3/4" x 5"29.6040.013.320
99939994CA-2017-1199145/4/20175/9/2017Second ClassCC-12220Chris CortesConsumerUnited StatesWestminster...92683WestOFF-AP-10002684Office SuppliesAppliancesAcco 7-Outlet Masterpiece Power Center, Wihtou...243.1620.072.948

2 rows × 21 columns

In [4]:
df.shape
Out[4]:
(9994, 21)

PART 1

Explanatory Data Analysis

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9994 non-null   int64  
 1   Order ID       9994 non-null   object 
 2   Order Date     9994 non-null   object 
 3   Ship Date      9994 non-null   object 
 4   Ship Mode      9994 non-null   object 
 5   Customer ID    9994 non-null   object 
 6   Customer Name  9994 non-null   object 
 7   Segment        9994 non-null   object 
 8   Country        9994 non-null   object 
 9   City           9994 non-null   object 
 10  State          9994 non-null   object 
 11  Postal Code    9994 non-null   int64  
 12  Region         9994 non-null   object 
 13  Product ID     9994 non-null   object 
 14  Category       9994 non-null   object 
 15  Sub-Category   9994 non-null   object 
 16  Product Name   9994 non-null   object 
 17  Sales          9994 non-null   float64
 18  Quantity       9994 non-null   int64  
 19  Discount       9994 non-null   float64
 20  Profit         9994 non-null   float64
dtypes: float64(3), int64(3), object(15)
memory usage: 1.0+ MB
In [6]:
df.columns
Out[6]:
Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode',
       'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State',
       'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category',
       'Product Name', 'Sales', 'Quantity', 'Discount', 'Profit'],
      dtype='object')
In [7]:
df.describe()
Out[7]:
Row IDPostal CodeSalesQuantityDiscountProfit
count9994.0000009994.0000009994.0000009994.0000009994.0000009994.000000
mean4997.50000055190.379428229.8580013.7895740.15620328.656896
std2885.16362932063.693350623.2451012.2251100.206452234.260108
min1.0000001040.0000000.4440001.0000000.000000-6599.978000
25%2499.25000023223.00000017.2800002.0000000.0000001.728750
50%4997.50000056430.50000054.4900003.0000000.2000008.666500
75%7495.75000090008.000000209.9400005.0000000.20000029.364000
max9994.00000099301.00000022638.48000014.0000000.8000008399.976000
In [8]:
categories = ['Ship Mode','Segment','State','Region', 'Category','Sub-Category']
indexrange = [1,2,3,4,5,6]
cat_class = []
y = []
x = []
title = ['Shipping Modes specified by the Customers.','The segments where the Customers belong.',
         'States of residence of the Customers.','Regions where the Customers belong.',
         'Categories of the products ordered.','Sub-Categories of the products ordered.']
for i in range(6):
    i = df[categories[i]].value_counts()
    cat_class.append(i)
plt.figure(figsize = (10,25))
for i in range(6):
    y.append(cat_class[i].index)
    x.append(cat_class[i].values)
    plt.subplot(3,2,indexrange[i])
    sns.barplot(x[i], y[i], palette='magma')
    plt.title(title[i])
In [9]:
Sales_shipdf = df.groupby(['State','Ship Mode']).aggregate({'Sales':'mean'}).reset_index()
Sales_shipdf = Sales_shipdf.pivot(index='State', columns='Ship Mode', values='Sales').reset_index()
Sales_shipdf['total'] = Sales_shipdf['First Class']+Sales_shipdf['Same Day']+Sales_shipdf['Second Class']+Sales_shipdf['Standard Class']

Sales_shipdf = Sales_shipdf.sort_values(by = 'total', ascending = False)
Sales_shipdf.drop(columns ='total', inplace=True)
Sales_shipdf.head()
Out[9]:
Ship ModeStateFirst ClassSame DaySecond ClassStandard Class
26Nevada392.239600475.944000567.149667356.484000
31North Carolina159.149263875.506929248.553333173.284871
20Michigan520.668857230.780222356.993813262.177506
12Indiana194.193125483.973333276.516857413.876421
34Oklahoma356.334000519.794286184.144286276.045745
In [10]:
Sales_shipdf.set_index('State').plot(kind = 'bar', figsize = (18,5), stacked = True)
plt.title('Mean sales of States by Ship Mode')
plt.show()
In [11]:
sub_df = df.groupby(['State','Sub-Category']).aggregate({'Sales':'mean'}).reset_index()
sub_df = sub_df.pivot(index = 'State', columns = 'Sub-Category', values = 'Sales').reset_index()
sub_df['total'] = sub_df[['Accessories','Appliances','Art','Binders','Bookcases','Chairs','Copiers','Envelopes', 'Fasteners','Furnishings','Labels','Machines','Paper','Phones','Storage','Supplies','Tables']].sum()
sub_df.head()
Out[11]:
Sub-CategoryStateAccessoriesAppliancesArtBindersBookcasesChairsCopiersEnvelopesFastenersFurnishingsLabelsMachinesPaperPhonesStorageSuppliesTablestotal
0Alabama387.138333208.16000043.03000046.682000NaN783.108000899.97000052.9200003.62000013.493333253.2450003040.000000110.377500451.046667416.470000242.255000792.153333NaN
1Arizona154.370909129.07200099.22514362.438657173.071000406.192500NaN30.64571416.872000119.6022869.154667321.98300078.734194321.251130123.35161970.632000443.927778NaN
2Arkansas162.585455NaN13.050000288.983000638.820000836.640000NaN103.90000021.335000114.6980007.310000NaN67.126667534.202500215.000000NaN301.960000NaN
3California253.435442257.19063833.307030103.163652529.971567485.2259081444.67764770.37695711.417381104.69361334.9293981282.26156557.985986395.142651281.951812365.603488639.862423NaN
4Colorado152.532800151.92177819.14763618.177429175.947000588.915429439.99200015.12000011.13000081.5653338.9800001104.56300046.485053307.791000260.551385280.680000364.282500NaN
In [12]:
sub_df = sub_df.sort_values(by = 'total', ascending = False)
sub_df.drop(columns ='total', inplace=True)
sub_df.set_index('State').plot(kind = 'bar', figsize = (14,20), stacked = True)
plt.title("Mean sales of States by Sub Categories")
plt.show()
In [13]:
pf_seg_df = df.groupby(['State','Segment']).aggregate({'Profit':'mean'}).reset_index()
pf_seg_df = pf_seg_df.pivot(index='State', columns='Segment', values='Profit').reset_index()
pf_seg_df['total'] = pf_seg_df['Consumer']+pf_seg_df['Corporate']+pf_seg_df['Home Office']

pf_seg_df = pf_seg_df.sort_values(by = 'total', ascending = False)
pf_seg_df.drop(columns ='total', inplace=True)
pf_seg_df.head()
Out[13]:
SegmentStateConsumerCorporateHome Office
43Vermont75.023200306.645460336.635000
12Indiana55.153593299.84684958.942949
23Missouri30.475097193.666453182.550769
37Rhode Island41.12743354.708805275.991645
26Nevada40.75932859.706515225.864162
In [14]:
pf_seg_df.set_index('State').plot(kind = 'bar', figsize = (18,5), stacked = True)
plt.title("Mean Profit of State by Segments")
plt.show()
In [15]:
q_r_df = df.groupby(['State','Ship Mode']).aggregate({'Quantity':'sum'}).reset_index()
q_r_df = q_r_df.pivot(index='State', columns='Ship Mode', values='Quantity').reset_index()
q_r_df['total'] = q_r_df['First Class']+q_r_df['Same Day']+q_r_df['Second Class']+q_r_df['Standard Class']


q_r_df = q_r_df.sort_values(by = 'total', ascending = False)
q_r_df.drop(columns ='total', inplace=True)
q_r_df.head()
Out[15]:
Ship ModeStateFirst ClassSame DaySecond ClassStandard Class
3California1233.0404.01475.04555.0
30New York651.0250.0768.02555.0
41Texas560.0158.0718.02288.0
36Pennsylvania399.069.0353.01332.0
45Washington211.0152.0372.01148.0
In [16]:
q_r_df.set_index('State').plot(kind = 'bar', figsize = (14,8), stacked = True)
plt.title("Total Quantity of State by Ship Mode")
plt.show()
In [17]:
plt.figure(figsize=(14,6))
region_df = df.groupby(['Region', 'Segment']).size().reset_index().rename(columns= {0:'customers'}).sort_values('customers', ascending = False)
region_df.pivot(columns = 'Region', index = 'Segment', values = 'customers')
sns.barplot(x='Region', y='customers', data=region_df, palette='magma', hue='Segment',dodge = False)
plt.title("Total customers of Region by Segment")
plt.show()
Analysis of most customers in some categories

The West Region which was the having most customers.

In [18]:
west_df = df.loc[df['Region']=='West']

west_df['Sub-Category'].value_counts().reset_index()

plt.figure(figsize=(14,6))
sns.barplot(x=west_df['Sub-Category'].value_counts().values, y=west_df['Sub-Category'].value_counts().index, palette='magma')
plt.title("Sub-Category of West Region")
plt.show()

The Standard Class in Ship-Mode which was the having most customers.

In [19]:
stndclz_df = df.loc[df['Ship Mode'] == 'Standard Class']
stndclz_df = stndclz_df[['Ship Date', 'Sales', 'Profit']].sort_values(by='Ship Date')
stndclz_df['monthyr'] = pd.to_datetime(stndclz_df['Ship Date']).dt.to_period('M')
stndclz_df = stndclz_df.groupby('monthyr').agg({'Sales':'sum', 'Profit':'sum'}).reset_index()

plt.figure(figsize=(18,8))
sns.lineplot(x = stndclz_df.index, y = 'Sales', data = stndclz_df, label = 'Sales', linewidth = 3)
sns.lineplot(x = stndclz_df.index, y = 'Profit', data = stndclz_df, label = 'Profit', linewidth = 3)

labels = stndclz_df['monthyr'].values
plt.xticks(range(1,stndclz_df.shape[0]+1), labels=labels)
plt.xticks(rotation=90)
plt.title('Monthly total Profit and Sales variation from 2014 to 2017 of Standard Class')
plt.show()

California State which was the having most customers.

In [20]:
california_df = df.loc[df['State'] == 'California']

plt.figure(figsize=(16,5))
plt.subplot(1,2,2)
sns.barplot(california_df['Sub-Category'].value_counts().values, california_df['Sub-Category'].value_counts().index, palette='magma')
plt.title('Orderd Sub-Categories of California')

plt.subplot(1,2,1)
sns.barplot(california_df['Category'].value_counts().values, california_df['Category'].value_counts().index, palette='magma')
plt.title('Orderd Categories of California')

plt.show()

PART 2

Feature Engineering

In [21]:
df_drop = df.drop(['Row ID', 'Order ID','City', 'Order Date', 'Ship Date', 'Customer ID', 'Customer Name',  'Country', 'Postal Code',  'Product ID','Product Name'], axis=1)
In [22]:
plt.figure(figsize=(10,4))
sns.heatmap(df.isnull(), yticklabels=False, cbar = False)
plt.title('Heatmap of Null values')
plt.show()
In [23]:
df_drop.head()
Out[23]:
Ship ModeSegmentStateRegionCategorySub-CategorySalesQuantityDiscountProfit
0Second ClassConsumerKentuckySouthFurnitureBookcases261.960020.0041.9136
1Second ClassConsumerKentuckySouthFurnitureChairs731.940030.00219.5820
2Second ClassCorporateCaliforniaWestOffice SuppliesLabels14.620020.006.8714
3Standard ClassConsumerFloridaSouthFurnitureTables957.577550.45-383.0310
4Standard ClassConsumerFloridaSouthOffice SuppliesStorage22.368020.202.5164
In [24]:
scaler = StandardScaler()

scalecol = ['Sales', 'Quantity', 'Discount', 'Profit']
for i in scalecol:
    df_drop[i] = scaler.fit_transform(df_drop[[i]])
In [25]:
lablecol = df_drop[['Ship Mode', 'Segment', 'State', 'Region', 'Category','Sub-Category']]
dummi = pd.get_dummies(lablecol,drop_first=True)
In [26]:
df_drop = pd.concat([df_drop,dummi], axis=1)
df_drop = df_drop.drop(['Ship Mode', 'Segment', 'State', 'Region', 'Category','Sub-Category'], axis=1)
df_drop.head()
Out[26]:
SalesQuantityDiscountProfitShip Mode_Same DayShip Mode_Second ClassShip Mode_Standard ClassSegment_CorporateSegment_Home OfficeState_Arizona...Sub-Category_EnvelopesSub-Category_FastenersSub-Category_FurnishingsSub-Category_LabelsSub-Category_MachinesSub-Category_PaperSub-Category_PhonesSub-Category_StorageSub-Category_SuppliesSub-Category_Tables
00.051510-0.804303-0.7566430.056593010000...0000000000
10.805633-0.354865-0.7566430.815054010000...0000000000
2-0.345368-0.804303-0.756643-0.093002010100...0001000000
31.1676880.5440121.423149-1.757484001000...0000000001
4-0.332935-0.8043030.212153-0.111593001000...0000000100

5 rows × 78 columns

PART 3

Models Training

In [27]:
y = df_drop['Profit']
x = df_drop.drop(['Profit'], axis=1)
x.shape, y.shape
Out[27]:
((9994, 77), (9994,))
In [28]:
x_train,x_test, y_train,y_test = train_test_split(x,y, test_size=0.2,random_state=3)
In [29]:
models = ['SVR','Ridge', 'LinearRegression', 'RandomForestRegressor', 'DecisionTreeRegressor']

svr_model = SVR(kernel='rbf').fit(x_train,y_train)
ridge_model = Ridge().fit(x_train,y_train)
lr_model = LinearRegression().fit(x_train,y_train)
rf_model = RandomForestRegressor(n_estimators=10, random_state=0).fit(x_train,y_train)
dt_model = DecisionTreeRegressor(random_state=0).fit(x_train,y_train)

svr_pred = svr_model.predict(x_test)
ridge_pred = ridge_model.predict(x_test)
lr_pred = lr_model.predict(x_test)
rf_pred = rf_model.predict(x_test)
dt_pred = dt_model.predict(x_test)

R2_values = [r2_score(y_test,svr_pred),r2_score(y_test,ridge_pred),
             r2_score(y_test,lr_pred),r2_score(y_test,rf_pred),r2_score(y_test,dt_pred)]
In [30]:
R2_values = sorted(R2_values, reverse=True)
R2_values[0]
Out[30]:
0.8405019729439146
In [31]:
plt.figure(figsize=(14,10))
sns.barplot(y = models, x = R2_values, palette = 'magma')
plt.title("Different type of algorithms' accuracies")
plt.xlabel('R Squared value')
plt.ylabel('Algorithm')
plt.show()
Parameter Tuning

Out of this five algorithms, there are 2 algorithms which fitting the accuracies more than 80%. So out of them the Random Forest Regression was the best algorithm for the particular dataset. Let's do parameter tuning for it to see whether the acuuracy can be increased.

Hyperparameter tuning with RandomizedSerachCV
In [32]:
%%time

from sklearn.model_selection import RandomizedSearchCV

#Different RandomForestRegressor hyperparameters
rf_grid = {'n_estimators': np.arange(10,100,10),
          'max_depth': [None,3,5,10],
          'min_samples_split':np.arange(2,20,2),
          'min_samples_leaf':np.arange(1,20,2),
          'max_features':[0.5,1,'sqrt','auto'],
          'max_samples':[3000],
          'criterion':['mae','mse']}

#Instantiate RandomizedSerachcCV model
rs_model = RandomizedSearchCV(RandomForestRegressor(n_jobs = -1,
                                                   random_state=3),
                             param_distributions=rf_grid,
                             n_iter=3,
                             cv = 5,
                             verbose=True)

#fit the RandomizedSearchCV model
rs_model.fit(x, y)
Fitting 5 folds for each of 3 candidates, totalling 15 fits
Wall time: 31.7 s
Out[32]:
RandomizedSearchCV(cv=5,
                   estimator=RandomForestRegressor(n_jobs=-1, random_state=3),
                   n_iter=3,
                   param_distributions={'criterion': ['mae', 'mse'],
                                        'max_depth': [None, 3, 5, 10],
                                        'max_features': [0.5, 1, 'sqrt',
                                                         'auto'],
                                        'max_samples': [3000],
                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),
                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),
                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},
                   verbose=True)
In [33]:
#Find the best model hyperparameters
rs_model.best_params_
Out[33]:
{'n_estimators': 60,
 'min_samples_split': 8,
 'min_samples_leaf': 5,
 'max_samples': 3000,
 'max_features': 'sqrt',
 'max_depth': None,
 'criterion': 'mse'}
In [34]:
#Evaluate the RandomizedSerachCV model
rs_model.best_score_
Out[34]:
0.3510010746913461

Here it was given less than the previous accuracy value (0.840). So I had to try in random way to increase the accuracy

In [35]:
n_estimates_rndm_st = np.arange(10,100,10)
estimater_values = []

for i in n_estimates_rndm_st:
    rfr_selected_model = RandomForestRegressor(n_estimators=i, random_state=i, criterion =  'mse').fit(x_train,y_train)
    rfr_pred = rfr_selected_model.predict(x_test)
    estimater_values.append(r2_score(y_test,rfr_pred))
In [37]:
plt.figure(figsize=(15,5))

sns.lineplot(x = n_estimates_rndm_st, y = estimater_values, label = 'estimater values', linewidth = 3)
plt.title("Accuracy variation for different values of 'n_estimators' & 'random_states'")
plt.xlabel('Estimater Values')
plt.ylabel('Model Accuracy')
plt.show()
Conclusion

So finaly it can be fitted as follow for n_estimators value and random_satate 10.

In [38]:
rfr_selected_model = RandomForestRegressor(n_estimators=10, random_state=10, criterion =  'mse').fit(x_train,y_train)
rfr_pred = rfr_selected_model.predict(x_test)
print("Tuning model's Accuracy is " + str(round((r2_score(y_test,rfr_pred))* 100, 2))+'%')
Tuning model's Accuracy is 87.37%

Post a Comment

0 Comments

Followers

INTRODUCTION

Here is planned to carry out my research and analytics works on this. Furthermore I have planned to collect here my all additional statistics works and any other materials which related to my field. Statistical software where related to statistical analysis, which are SAS, MINITAB, Excel, SPSS. And also I'll gather some working projects with related to data science programming language, basically I used python(Spyder, Jupyter Notebook) and R (R-Studio) documents in here.

CONECTIONS

  • Github
  • Twitter
  • Kaggle
  • Quora
  • LinkedIn
  • Reddit
  • Facebook

Contact Form

Name

Email *

Message *

Menu Footer Widget

  • Home
  • About Me
Crafted with by TemplatesYard | Distributed by Blogger