Saving the Sale: Analyzing Cancellation Trends at The Look E-commerce using R

Originally posted on medium

Cancelations can be a silent killer for e-commerce businesses, eroding revenue and damaging customer relationships. After analyse the success purchase on the The Look E-commerce dataset on this medium article, now we are going to analyse why users canceling their purchase.

This article aims to uncover the underlying factors contributing to order cancellations at The Look E-commerce and provide insights for potential improvements

Note : The data used in this analysis is fictional data and was cerated by Looker team for exploration purpose. While the specific findings may not be directly applicable to real-world scenarios, the methodologies and insights presented can serve as a valuable guide for exploring e-commerce datasets and deriving actionable recommendations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


-- SQL query to get the data for analysis
SELECT
  o.*,
  u.age,
  u.country,
  u.latitude AS user_lat,
  u.longitude AS user_lon,
  u.traffic_source,
  oi.sale_price AS sale_price,
  inv.product_retail_price,
  inv.product_category,
  inv.product_name,
  inv.product_brand,
  inv.product_department,
  dc.latitude AS dc_lat,
  dc.longitude AS dc_lon,
  dc.name AS dc_name 
FROM  `bigquery-public-data.thelook_ecommerce.orders` o
LEFT JOIN `bigquery-public-data.thelook_ecommerce.users` u ON o.user_id = u.id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.order_items` oi ON o.order_id = oi.order_id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.inventory_items` inv ON oi.inventory_item_id = inv.id
LEFT JOIN `bigquery-public-data.thelook_ecommerce.distribution_centers` dc ON inv.product_distribution_center_id = dc.id
;

After querying data from Biguery, download the data for analysis in R.

1
2


dat <- read.csv('~/Desktop/Project/hugo-blog/static/data/thelook_ecommerce_dataset.csv') 
head(dat)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35


##   order_id user_id    status gender              created_at returned_at
## 1      202     161 Cancelled      F 2022-02-22 03:58:00 UTC            
## 2      241     195 Cancelled      F 2022-08-26 18:54:00 UTC            
## 3      267     219 Cancelled      F 2022-07-17 08:09:00 UTC            
## 4      285     233 Cancelled      F 2020-08-19 05:21:00 UTC            
## 5      424     346 Cancelled      F 2023-12-04 11:51:00 UTC            
## 6      601     490 Cancelled      F 2024-06-21 04:30:00 UTC            
##   shipped_at delivered_at num_of_item age       country user_lat    user_lon
## 1                                   1  22        France 48.88459    2.509238
## 2                                   1  38 United States 35.97947 -118.887492
## 3                                   1  12         China 23.00895  113.196905
## 4                                   2  67 United States 30.66803  -87.194823
## 5                                   4  43        France 43.31911    5.988499
## 6                                   1  53         China 37.74177  115.729271
##   traffic_source sale_price product_retail_price product_category
## 1        Display     150.00               150.00          Dresses
## 2         Search      13.00                13.00             Plus
## 3         Search      35.00                35.00        Intimates
## 4       Facebook      50.00                50.00   Pants & Capris
## 5        Organic       8.99                 8.99   Sleep & Lounge
## 6         Search      17.00                17.00  Socks & Hosiery
##                                                             product_name
## 1                       Design History Women's Zig Zag Sequin Mini Dress
## 2                                         Ozone Women's Mona Linen Socks
## 3                     Playtex Women's Secrets Perfectly Smooth Underwire
## 4                                Misses Unionbay Boot Cut Corduroy Pants
## 5 Cami Sets Five Color Choices Sizes (S M L) Up2date Fashion Style#CamRG
## 6                              K. Bell Socks Women's 3 Button Leg Warmer
##     product_brand product_department dc_lat   dc_lon        dc_name
## 1  Design History              Women  29.95 -90.0667 New Orleans LA
## 2           Ozone              Women  29.95 -90.0667 New Orleans LA
## 3         Playtex              Women  29.95 -90.0667 New Orleans LA
## 4        UNIONBAY              Women  29.95 -90.0667 New Orleans LA
## 5 Up2date Fashion              Women  29.95 -90.0667 New Orleans LA
## 6         K. Bell              Women  29.95 -90.0667 New Orleans LA

Data Exploration

To begin our analysis, we examined the overall cancellation rate within the dataset. This provided a baseline understanding of the problem’s magnitude and its potential impact on the business. By calculating the percentage of cancelled orders and their associated revenue loss, we established a clear picture of the issue at hand.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


order_trend <- dat %>%
  mutate(
    year_month = floor_date(as_date(created_at), "month")) %>%
      group_by(year_month) %>%
      summarize(total_order = n_distinct(order_id),
                cancel_purchase = n_distinct(order_id[status=='Cancelled']),
                cancelation_rate = round(cancel_purchase / total_order * 100,2),
                amount = sum(sale_price * num_of_item),
                amount_cancel = sum(num_of_item[status=='Cancelled']*sale_price[status=='Cancelled'])
  )

summary(order_trend)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


##    year_month          total_order     cancel_purchase   cancelation_rate
##  Min.   :2019-01-01   Min.   :  14.0   Min.   :   1.00   Min.   : 7.14   
##  1st Qu.:2020-05-24   1st Qu.: 528.2   1st Qu.:  78.25   1st Qu.:14.34   
##  Median :2021-10-16   Median :1273.0   Median : 192.00   Median :14.94   
##  Mean   :2021-10-16   Mean   :1832.3   Mean   : 275.07   Mean   :14.97   
##  3rd Qu.:2023-03-08   3rd Qu.:2555.2   3rd Qu.: 366.00   3rd Qu.:15.51   
##  Max.   :2024-08-01   Max.   :8668.0   Max.   :1360.00   Max.   :27.17   
##      amount        amount_cancel     
##  Min.   :   3296   Min.   :   824.7  
##  1st Qu.:  82898   1st Qu.: 12280.2  
##  Median : 222120   Median : 32660.7  
##  Mean   : 300877   Mean   : 44802.4  
##  3rd Qu.: 405396   3rd Qu.: 59198.4  
##  Max.   :1420433   Max.   :224732.5

The analysis reveals an average monthly cancellation rate of 14.97%, resulting in approximate revenue losses of $44,802. To understand the trend of cancellations over time, we examined order data by month.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


ggplot(order_trend) +
  geom_col(aes(x = year_month, y = `total_order`, stat="identity", fill="# of Order")) +
  geom_line(aes(x = year_month, y = `cancelation_rate` * 100, color="Cancelation Rate",  group = 1),
  ) +
  scale_color_manual(values="black") +
  labs(title = "Purchase and Cancelation Rate Trend", x = "Month", y = "#of Order") +
  scale_y_continuous(sec.axis = sec_axis(~ . / 10000, name = "Cancelation Rate", labels = scales::percent)) +
  annotate(
    'text',
    x = as.Date("2023-03-01"),
    y = 7000,
    label = 'the number of order trend looks \nupward overtime',
    fontface = 'italic',
    size = 2.5
  ) +
  annotate(
    'curve',
    x = as.Date("2023-05-01"),
    y = 6600,
    yend = 5000,
    xend = as.Date("2024-01-01"),
    linewidth = 0.5,
    curvature = 0.5,
    arrow = arrow(length = unit(0.2, 'cm'))
  ) +
   annotate(
    'text',
    x = as.Date("2020-01-01"),
    y = 3300,
    label = 'the trend of cancellation rates \ntends to be constant overtime',
    fontface = 'italic',
    size = 2.5
  ) +
  annotate(
    'curve',
    x = as.Date("2020-10-01"), 
    y = 3100,
    yend = 2000,
    xend = as.Date("2021-05-01"),
    linewidth = 0.5,
    curvature = -0.5,
    arrow = arrow(length = unit(0.1, 'cm'))
  ) +
  theme_classic() +
  theme(legend.key=element_blank(),
          legend.title=element_blank(),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          legend.box="horizontal",
          legend.position= c(.5,.95),
          legend.text=element_text(size=9),
          plot.title = element_text(color="black", size=14, face="bold.italic", hjust=0.5),
          axis.title.x = element_text(color="darkgrey", size=9),
          axis.title.y = element_text(color="darkgrey", size=9) 
        )

Despite an upward trend in overall order volume, the cancellation rate has remained consistent. This indicates a persistent issue that warrants further investigation. To identify potential causes, we will examine the relations between order status with some variables.

Data Analysis

To enhance our analysis, we conducted feature engineering to create new variables.

is_cancel : A binary indicator was created, assigning 1 to cancelled orders and 0 to completed orders.
age_group : Customer ages were grouped into specific intervals to identify potential age-related patterns in cancellations.
deliver_to : Given the US-based distribution center, a binary variable was created to distinguish between domestic and international orders, exploring potential shipping-related impacts on cancellations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


order_dat <- dat %>%
  filter(status!="Processing")%>%
  group_by(order_id) %>%
  summarise(age=mean(age),
            gender=gender[[1]],
            product_department = product_department[[1]],
            traffic_source = traffic_source[[1]],
          total_item = sum(num_of_item),
          total_amount = sum(num_of_item * sale_price),
          is_cancel = ifelse(status == 'Cancelled', 1, 0),
          deliver_to = ifelse(country == "United States", 1, 0) # 0 if deliver to outside US
          ) %>% 
  ungroup %>% distinct(order_id, .keep_all = TRUE)
#change categorical data to factor

col <- c("gender","product_department", "traffic_source", "is_cancel", "deliver_to")

order_dat %>% mutate_at(col, as.factor)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


## # A tibble: 99,581 × 9
##    order_id   age gender product_department traffic_source total_item
##       <int> <dbl> <fct>  <fct>              <fct>               <int>
##  1        2    41 M      Men                Search                  1
##  2        3    41 M      Men                Search                  1
##  3        4    41 M      Men                Search                  1
##  4        5    55 F      Women              Search                  1
##  5        6    31 M      Men                Search                  1
##  6        7    68 F      Women              Facebook                1
##  7        8    33 M      Men                Facebook               16
##  8       10    70 M      Men                Display                 1
##  9       12    15 F      Women              Search                  1
## 10       13    15 F      Women              Search                  1
## # ℹ 99,571 more rows
## # ℹ 3 more variables: total_amount <dbl>, is_cancel <fct>, deliver_to <fct>

1

head(order_dat)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


## # A tibble: 6 × 9
##   order_id   age gender product_department traffic_source total_item
##      <int> <dbl> <chr>  <chr>              <chr>               <int>
## 1        2    41 M      Men                Search                  1
## 2        3    41 M      Men                Search                  1
## 3        4    41 M      Men                Search                  1
## 4        5    55 F      Women              Search                  1
## 5        6    31 M      Men                Search                  1
## 6        7    68 F      Women              Facebook                1
## # ℹ 3 more variables: total_amount <dbl>, is_cancel <dbl>, deliver_to <dbl>

Analyse numerical variables vs order status

To understand how numerical variables influence order cancellations, we examined age, total items purchased, and total purchase amount. Before conducting correlation tests, we assessed the normality of these variables using the Kolmogorov-Smirnov test. This test was chosen over Shapiro-Wilk due to its suitability for larger datasets.

H0 : Data follow normal distribution

H1 : Data doesn’t follow distribution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


#normality check

norm_test <- function(variable) {
  ks_test <- ks.test(order_dat[[variable]], y='pnorm') #set y to pnorm since we want test our variables toward normal distribution
  res <- paste0("data :", variable, "  ", "p_value :", ks_test$p.value)
  return(res)
}

#Select only numerical variables
numeric_vars <- c("total_item", "age", "total_amount")

# Calculate the correlation matrix
norm_test_results <- lapply(numeric_vars, norm_test)

names(norm_test_results) <- norm_test_results
norm_test_results 

1
2
3
4
5
6
7
8


## $`data :total_item  p_value :0`
## [1] "data :total_item  p_value :0"
## 
## $`data :age  p_value :0`
## [1] "data :age  p_value :0"
## 
## $`data :total_amount  p_value :0`
## [1] "data :total_amount  p_value :0"

The Kolmogorov-Smirnov test revealed that all three numerical variables (age, total items, and total amount) deviate significantly from a normal distribution (p-value < 0.005). As a result, non-parametric tests are required. To assess differences in these variables between successful and cancelled orders, we employed the Wilcoxon rank-sum test.

H0 : There is no difference (in terms of central tendency) between the two groups in the population.

H1 : There is difference (in terms of central tendency) between the two groups in the population.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


library(ltm)

#function for wilcoxon test
wilcox_test <- function(variable) {
  wilcox <- wilcox.test(order_dat[[variable]][which(order_dat$is_cancel == 1)], order_dat[[variable]][which(order_dat$is_cancel == 0)])
  res <- paste0("data :", variable, "  ", "p_value :", wilcox$p.value)
  return(res)
}

# Calculate the correlation
wilcox_test_results <- lapply(numeric_vars, wilcox_test)

wilcox_test_results

1
2
3
4
5
6
7
8


## [[1]]
## [1] "data :total_item  p_value :0.434329795381456"
## 
## [[2]]
## [1] "data :age  p_value :0.239814229029068"
## 
## [[3]]
## [1] "data :total_amount  p_value :0.734469519974685"

With p-values greater than 0.05 for all numerical variables, we fail to reject the null hypothesis. This indicates no statistically significant difference in the central tendency (median) of these variables between successful and cancelled orders. Visualizations also confirmed these findings, showing overlapping distributions for both groups.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


library(ggpubr)
set.seed(123)

age_bx <- ggplot(order_dat) + 
  geom_boxplot(aes(x=as.factor(is_cancel), y=age)) +
  theme_bw()

item_bx <- ggplot(order_dat) + 
  geom_boxplot(aes(x=as.factor(is_cancel), y=total_item)) +
  theme_bw()

amount_bx <- ggplot(order_dat) + 
  geom_boxplot(aes(x=as.factor(is_cancel), y=total_amount)) +
  theme_bw()

figure <- ggarrange(age_bx, item_bx, amount_bx,
                    common.legend = TRUE, legend="bottom", ncol = 3, nrow = 1)

annotate_figure(figure, top = text_grob("Order Status By Age, Total Item and Total Amount", 
               color = "red", face = "bold", size = 12))

Visualizations also confirmed these findings, showing overlapping distributions for both groups. across age, number of items, or total purchase amount.

Age of users has no signficant correlation towards order status
Number of item of purchase has no signficant correlation towards order status
Total amount of purchase has no signficant correlation towards order status

Consequently, these numerical variables do not appear to significantly influence order cancellation rates.

Analyse categorical variables vs order status

Next, we’ll examine how categorical variables such as product department, age group, traffic source, and delivery destination relate to order cancellations. To determine if there’s a significant association between these variables and order status, we’ll conduct chi-square tests.

H0 : There’s no association between categorical variables and order status

H1 : There’s association between categorical variables and order status

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


# Categorize age group into categorical variables
order_dat <- order_dat %>% 
  mutate(
    # Create categories
    age_group = dplyr::case_when(
      age <= 17            ~ "0-17",
      age > 17 & age <= 25 ~ "18-25",
      age > 25 & age <= 34 ~ "26-45",
      age > 45 & age <= 55  ~ "46- 54",
      age > 54  ~ ">55"
    )
)
order_dat$age_group = as.factor(order_dat$age_group)

# Function to perform Chi-Square test for a given variable
chi_square_test <- function(variable) {
  table <- table(order_dat[[variable]], order_dat$is_cancel)
  chisq <- chisq.test(table)
  return(chisq)
}

# List of categorical variables
categorical_vars <- c("product_department", "gender", "traffic_source", "deliver_to", "age_group")

# Apply Chi-Square test to each categorical variable
chi_square_results <- lapply(categorical_vars, chi_square_test)

# Display the results
names(chi_square_results) <- categorical_vars
chi_square_results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


## $product_department
## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 1.705, df = 1, p-value = 0.1916
## 
## 
## $gender
## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 1.705, df = 1, p-value = 0.1916
## 
## 
## $traffic_source
## 
## 	Pearson's Chi-squared test
## 
## data:  table
## X-squared = 4.6306, df = 4, p-value = 0.3273
## 
## 
## $deliver_to
## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 0.18457, df = 1, p-value = 0.6675
## 
## 
## $age_group
## 
## 	Pearson's Chi-squared test
## 
## data:  table
## X-squared = 3.2614, df = 4, p-value = 0.5151

Based on the chi-square test results, here’s a the conclusion:

Product Department and Gender: Both tests show identical results (p-value = 0.1916). This suggests no significant association between order status and product department or gender. The similarity in results might indicate these variables are closely related or potentially the same in our dataset.
Traffic Source: With p-value = 0.3273, there’s no significant association between traffic source and order status. The different sources (organic, email, social, etc.) don’t appear to influence cancellation rates significantly.
Delivery Destination: The test (p-value = 0.6675) indicates no significant relationship between delivery destination (export or domestic) and order status. Whether an order is domestic (US) or international doesn’t seem to affect cancellation rates.
Age Group: With p-value = 0.5151, there’s no significant association between age groups and order status. This suggests that cancellation rates don’t vary significantly across different age categories.

In summary, none of the tested variables show a statistically significant association with order status at the conventional 0.05 significance level. This implies that these factors alone may not be strong predictors of order cancellations in our e-commerce dataset.

Findings and Recommendations

The lack of significant correlations suggests that factors beyond the analyzed variables contribute to cancellations. Potential areas for further investigation include:

Product-related issues: Product quality, descriptions, or images might influence cancellations.
Customer service: Issues during the checkout process or post-purchase support could impact customer satisfaction.
Website experience: Website performance, navigation, or mobile optimization might affect the purchase journey.
Pricing and promotions: Price perception or promotional factors could influence purchase decisions.
By focusing on these areas, The Look E-commerce can potentially reduce cancellation rates and improve customer satisfaction.

Conclusion

While this analysis identified no direct correlations between the studied variables and order cancellations, it highlights the need for further exploration into other potential factors. Continuous monitoring of cancellation rates and in-depth analysis of specific customer segments are essential for effective retention strategies.