PROFIT PREDICTION OF USA SUPERSTORE CUSTOMERS
Explanatory Data Analysis
Feature Engineering
ML Algorithms Training
Metadata
Row ID => Unique ID for each row.
Order ID => Unique Order ID for each Customer.
Order Date => Order Date of the product.
Ship Date => Shipping Date of the Product.
Ship Mode=> Shipping Mode specified by the Customer.
Customer ID => Unique ID to identify each Customer.
Customer Name => Name of the Customer.
Segment => The segment where the Customer belongs.
Country => Country of residence of the Customer.
City => City of residence of of the Customer.
State => State of residence of the Customer.
Postal Code => Postal Code of every Customer.
Region => Region where the Customer belong.
Product ID => Unique ID of the Product.
Category => Category of the product ordered.
Sub-Category => Sub-Category of the product ordered.
Product Name => Name of the Product.
Sales => Sales of the Product.
Quantity => Quantity of the Product.
Discount => Discount provided.
Profit => Profit/Loss incurred.
Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | ... | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | Quantity | Discount | Profit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | CA-2016-152156 | 11/8/2016 | 11/11/2016 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | 42420 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.96 | 2 | 0.0 | 41.9136 |
1 | 2 | CA-2016-152156 | 11/8/2016 | 11/11/2016 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | 42420 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.94 | 3 | 0.0 | 219.5820 |
2 rows × 21 columns
Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | ... | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | Quantity | Discount | Profit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9992 | 9993 | CA-2017-121258 | 2/26/2017 | 3/3/2017 | Standard Class | DB-13060 | Dave Brooks | Consumer | United States | Costa Mesa | ... | 92627 | West | OFF-PA-10004041 | Office Supplies | Paper | It's Hot Message Books with Stickers, 2 3/4" x 5" | 29.60 | 4 | 0.0 | 13.320 |
9993 | 9994 | CA-2017-119914 | 5/4/2017 | 5/9/2017 | Second Class | CC-12220 | Chris Cortes | Consumer | United States | Westminster | ... | 92683 | West | OFF-AP-10002684 | Office Supplies | Appliances | Acco 7-Outlet Masterpiece Power Center, Wihtou... | 243.16 | 2 | 0.0 | 72.948 |
2 rows × 21 columns
(9994, 21)
PART 1
Explanatory Data Analysis
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9994 entries, 0 to 9993 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Row ID 9994 non-null int64 1 Order ID 9994 non-null object 2 Order Date 9994 non-null object 3 Ship Date 9994 non-null object 4 Ship Mode 9994 non-null object 5 Customer ID 9994 non-null object 6 Customer Name 9994 non-null object 7 Segment 9994 non-null object 8 Country 9994 non-null object 9 City 9994 non-null object 10 State 9994 non-null object 11 Postal Code 9994 non-null int64 12 Region 9994 non-null object 13 Product ID 9994 non-null object 14 Category 9994 non-null object 15 Sub-Category 9994 non-null object 16 Product Name 9994 non-null object 17 Sales 9994 non-null float64 18 Quantity 9994 non-null int64 19 Discount 9994 non-null float64 20 Profit 9994 non-null float64 dtypes: float64(3), int64(3), object(15) memory usage: 1.0+ MB
Index(['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount', 'Profit'], dtype='object')
Row ID | Postal Code | Sales | Quantity | Discount | Profit | |
---|---|---|---|---|---|---|
count | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 | 9994.000000 |
mean | 4997.500000 | 55190.379428 | 229.858001 | 3.789574 | 0.156203 | 28.656896 |
std | 2885.163629 | 32063.693350 | 623.245101 | 2.225110 | 0.206452 | 234.260108 |
min | 1.000000 | 1040.000000 | 0.444000 | 1.000000 | 0.000000 | -6599.978000 |
25% | 2499.250000 | 23223.000000 | 17.280000 | 2.000000 | 0.000000 | 1.728750 |
50% | 4997.500000 | 56430.500000 | 54.490000 | 3.000000 | 0.200000 | 8.666500 |
75% | 7495.750000 | 90008.000000 | 209.940000 | 5.000000 | 0.200000 | 29.364000 |
max | 9994.000000 | 99301.000000 | 22638.480000 | 14.000000 | 0.800000 | 8399.976000 |
Ship Mode | State | First Class | Same Day | Second Class | Standard Class |
---|---|---|---|---|---|
26 | Nevada | 392.239600 | 475.944000 | 567.149667 | 356.484000 |
31 | North Carolina | 159.149263 | 875.506929 | 248.553333 | 173.284871 |
20 | Michigan | 520.668857 | 230.780222 | 356.993813 | 262.177506 |
12 | Indiana | 194.193125 | 483.973333 | 276.516857 | 413.876421 |
34 | Oklahoma | 356.334000 | 519.794286 | 184.144286 | 276.045745 |
Sub-Category | State | Accessories | Appliances | Art | Binders | Bookcases | Chairs | Copiers | Envelopes | Fasteners | Furnishings | Labels | Machines | Paper | Phones | Storage | Supplies | Tables | total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 387.138333 | 208.160000 | 43.030000 | 46.682000 | NaN | 783.108000 | 899.970000 | 52.920000 | 3.620000 | 13.493333 | 253.245000 | 3040.000000 | 110.377500 | 451.046667 | 416.470000 | 242.255000 | 792.153333 | NaN |
1 | Arizona | 154.370909 | 129.072000 | 99.225143 | 62.438657 | 173.071000 | 406.192500 | NaN | 30.645714 | 16.872000 | 119.602286 | 9.154667 | 321.983000 | 78.734194 | 321.251130 | 123.351619 | 70.632000 | 443.927778 | NaN |
2 | Arkansas | 162.585455 | NaN | 13.050000 | 288.983000 | 638.820000 | 836.640000 | NaN | 103.900000 | 21.335000 | 114.698000 | 7.310000 | NaN | 67.126667 | 534.202500 | 215.000000 | NaN | 301.960000 | NaN |
3 | California | 253.435442 | 257.190638 | 33.307030 | 103.163652 | 529.971567 | 485.225908 | 1444.677647 | 70.376957 | 11.417381 | 104.693613 | 34.929398 | 1282.261565 | 57.985986 | 395.142651 | 281.951812 | 365.603488 | 639.862423 | NaN |
4 | Colorado | 152.532800 | 151.921778 | 19.147636 | 18.177429 | 175.947000 | 588.915429 | 439.992000 | 15.120000 | 11.130000 | 81.565333 | 8.980000 | 1104.563000 | 46.485053 | 307.791000 | 260.551385 | 280.680000 | 364.282500 | NaN |
Segment | State | Consumer | Corporate | Home Office |
---|---|---|---|---|
43 | Vermont | 75.023200 | 306.645460 | 336.635000 |
12 | Indiana | 55.153593 | 299.846849 | 58.942949 |
23 | Missouri | 30.475097 | 193.666453 | 182.550769 |
37 | Rhode Island | 41.127433 | 54.708805 | 275.991645 |
26 | Nevada | 40.759328 | 59.706515 | 225.864162 |
Ship Mode | State | First Class | Same Day | Second Class | Standard Class |
---|---|---|---|---|---|
3 | California | 1233.0 | 404.0 | 1475.0 | 4555.0 |
30 | New York | 651.0 | 250.0 | 768.0 | 2555.0 |
41 | Texas | 560.0 | 158.0 | 718.0 | 2288.0 |
36 | Pennsylvania | 399.0 | 69.0 | 353.0 | 1332.0 |
45 | Washington | 211.0 | 152.0 | 372.0 | 1148.0 |
Analysis of most customers in some categories
The West Region which was the having most customers.
The Standard Class in Ship-Mode which was the having most customers.
California State which was the having most customers.
PART 2
Feature Engineering
Ship Mode | Segment | State | Region | Category | Sub-Category | Sales | Quantity | Discount | Profit | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Second Class | Consumer | Kentucky | South | Furniture | Bookcases | 261.9600 | 2 | 0.00 | 41.9136 |
1 | Second Class | Consumer | Kentucky | South | Furniture | Chairs | 731.9400 | 3 | 0.00 | 219.5820 |
2 | Second Class | Corporate | California | West | Office Supplies | Labels | 14.6200 | 2 | 0.00 | 6.8714 |
3 | Standard Class | Consumer | Florida | South | Furniture | Tables | 957.5775 | 5 | 0.45 | -383.0310 |
4 | Standard Class | Consumer | Florida | South | Office Supplies | Storage | 22.3680 | 2 | 0.20 | 2.5164 |
Sales | Quantity | Discount | Profit | Ship Mode_Same Day | Ship Mode_Second Class | Ship Mode_Standard Class | Segment_Corporate | Segment_Home Office | State_Arizona | ... | Sub-Category_Envelopes | Sub-Category_Fasteners | Sub-Category_Furnishings | Sub-Category_Labels | Sub-Category_Machines | Sub-Category_Paper | Sub-Category_Phones | Sub-Category_Storage | Sub-Category_Supplies | Sub-Category_Tables | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.051510 | -0.804303 | -0.756643 | 0.056593 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0.805633 | -0.354865 | -0.756643 | 0.815054 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | -0.345368 | -0.804303 | -0.756643 | -0.093002 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1.167688 | 0.544012 | 1.423149 | -1.757484 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | -0.332935 | -0.804303 | 0.212153 | -0.111593 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 78 columns
PART 3
Models Training
((9994, 77), (9994,))
0.8405019729439146
Parameter Tuning
Out of this five algorithms, there are 2 algorithms which fitting the accuracies more than 80%. So out of them the Random Forest Regression was the best algorithm for the particular dataset. Let's do parameter tuning for it to see whether the acuuracy can be increased.
Hyperparameter tuning with RandomizedSerachCV
Fitting 5 folds for each of 3 candidates, totalling 15 fits Wall time: 31.7 s
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(n_jobs=-1, random_state=3), n_iter=3, param_distributions={'criterion': ['mae', 'mse'], 'max_depth': [None, 3, 5, 10], 'max_features': [0.5, 1, 'sqrt', 'auto'], 'max_samples': [3000], 'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]), 'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]), 'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])}, verbose=True)
{'n_estimators': 60, 'min_samples_split': 8, 'min_samples_leaf': 5, 'max_samples': 3000, 'max_features': 'sqrt', 'max_depth': None, 'criterion': 'mse'}
0.3510010746913461
Here it was given less than the previous accuracy value (0.840). So I had to try in random way to increase the accuracy
Conclusion
So finaly it can be fitted as follow for n_estimators value and random_satate 10.
Tuning model's Accuracy is 87.37%
0 Comments