Data Science Challenge for Wolt
By: Agatha Priandini
Summary
The purpose of this notebook is to present key metrics for Wolt’s Consumer, Merchant and Courier & Logistics domains.
Output
Table of Contents
- Section 1. Pre-analysis
- Section 2. Univariate Analysis
- Section 3. Multivariate Analysis
- Section 4. RFM, Relapse and CLV Analysis
- Conclusion
- Limitations
Load libraries and data
# -- Load libraries and data library(tidyverse) library(dplyr) library(ggplot2) library(jsonlite) library(janitor) library(lubridate)
data <- read.csv("/Users/agathapriandini/Desktop/dataset_for_datascience_assignment.csv")
Section 1: PRE-ANALYSIS
Goal: build basic understanding of the dataset by exploring anomalies, relationships, and descriptive patterns to form research questions.
Research objectives
- Understand what this dataset is about.
- Investigate the quality of this dataset and clean it if required.
Initial glimpse() showed rich purchase data (activity, registration, location). Many NA values appeared in purchase-related variables, so numeric NAs were set to 0, and character NAs to "Unknown". Types were adjusted where appropriate. LATE_NIGHT_PURCHASES was dropped due to missing values.
Summary of fixes
- Replace NA with 0 in
AVERAGE_PURCHASE_VALUE_EUR,MAX_PURCHASE_VALUE_EUR,MIN_PURCHASE_VALUE_EUR. - Convert
USER_IDto character. - Replace NA with
"Unknown"inUSER_HAS_VALID_PAYMENT_METHOD,REGISTRATION_COUNTRY,PREFERRED_DEVICE,MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE,MOST_COMMON_WEEKDAY_TO_PURCHASE. - Drop
LATE_NIGHT_PURCHASES. - Move
USER_IDto first column.
glimpse(data)
# General data-type cleaning
data <- data %>%
mutate(
AVG_PURCHASE_VALUE_EUR = replace_na(AVG_PURCHASE_VALUE_EUR, 0),
MAX_PURCHASE_VALUE_EUR = replace_na(MAX_PURCHASE_VALUE_EUR, 0),
MIN_PURCHASE_VALUE_EUR = replace_na(MIN_PURCHASE_VALUE_EUR, 0),
USER_ID = as.character(USER_ID),
USER_HAS_VALID_PAYMENT_METHOD = replace_na(as.character(USER_HAS_VALID_PAYMENT_METHOD), "Unknown"),
REGISTRATION_COUNTRY = replace_na(as.character(REGISTRATION_COUNTRY), "Unknown"),
PREFERRED_DEVICE = replace_na(as.character(PREFERRED_DEVICE), "Unknown"),
MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE = replace_na(as.character(MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE), "Unknown"),
MOST_COMMON_WEEKDAY_TO_PURCHASE = replace_na(as.character(MOST_COMMON_WEEKDAY_TO_PURCHASE), "Unknown")
) %>%
relocate(USER_ID, .before = everything()) %>%
select(-LATE_NIGHT_PURCHASES)
Sample glimpse output
## Rows: 21,983
## Columns: 30
## $ REGISTRATION_DATE <chr> "2019-09-01 00:00:00.000", ...
## $ REGISTRATION_COUNTRY <chr> "DNK", "FIN", "DNK", ...
## $ PURCHASE_COUNT <int> 0, 1, 19, 0, ...
## ... (truncated for brevity)
Section 2: UNIVARIATE ANALYSIS
Objectives:
- Find numerical outliers.
- Determine variable frequencies.
- Check for clusters.
Users with no purchases were removed (MIN_PURCHASE_VALUE_EUR > 0) to stabilize analyses.
Part A: Numerical outliers (IQR)
Outliers identified with Q1, Q3, and 1.5×IQR bounds. Highlights:
WEB_PURCHASES: 8 outliers beyond [-1.5, 2.5]TOTAL_PURCHASES_EUR: 44 outliers beyond [-213.53, 442.24]- Behavioral metrics show notable outliers (e.g.,
AVG_DAYS_BETWEEN_PURCHASES,AVG_PURCHASE_VALUE_EUR)
sum(is.na(data))
data <- data %>%
filter(MIN_PURCHASE_VALUE_EUR > 0)
summary(data)
# IQR outliers
num_outlier_summary <- data %>%
select(where(is.numeric)) %>%
map_dfr(~{
x <- .x
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
tibble(
variable = deparse(substitute(.x)),
lower_bound = q1 - 1.5*iqr,
upper_bound = q3 + 1.5*iqr,
outlier_count = sum(x < (q1 - 1.5*iqr) | x > (q3 + 1.5*iqr), na.rm = TRUE)
)
}, .id = "var")
print(num_outlier_summary)
Part B: Variable frequencies
Numeric summaries and categorical distributions to surface clusters/outliers.
- Many users make few purchases; a minority make many.
- 95%+ users are from FIN, DNK, GRC.
- ~40% of paying users have invalid payment methods (likely expired cards).
# Numerical variables frequency analysis
numeric_summary <- data %>%
select(where(is.numeric)) %>%
map_dfr(
.f = ~ {
x <- .x
x <- x[x != 0]
tibble(
mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
min = min(x, na.rm = TRUE),
max = max(x, na.rm = TRUE),
n = length(x)
)
},
.id = "variable"
)
print(numeric_summary)
# Character variables analysis (freq table + plots)
# ... (code omitted for brevity)
Part C: Initial clustering via value tiers
Quartiles on purchase value metrics → Tier 1–4 segmentation.
segment_quartiles <- function(x) { cut(x, breaks = quantile(x, probs = seq(0, 1, 0.25), na.rm = TRUE), include.lowest = TRUE, labels = c("Q1","Q2","Q3","Q4")) }purchase_vars <- c("AVG_PURCHASE_VALUE_EUR","MIN_PURCHASE_VALUE_EUR","MAX_PURCHASE_VALUE_EUR")
segmented_data <- data %>% mutate(across(all_of(purchase_vars), ~ segment_quartiles(.x), .names = "{.col}_segmented")) %>% mutate( avg_score = rowMeans(across(ends_with("_segmented"), ~ as.numeric(.)), na.rm = TRUE), user_tier = case_when( avg_score <= 1.5 ~ "Tier 1 (Low)", avg_score <= 2.5 ~ "Tier 2 (Moderate)", avg_score <= 3.5 ~ "Tier 3 (High)", TRUE ~ "Tier 4 (Top)" ) )
Section 3: MULTIVARIATE ANALYSIS
Views explored: Num–Num (correlation heatmaps), Cat–Num (box/violin), Cat–Cat (contingency).
Preprocessing
- Collapse
REGISTRATION_COUNTRYoutside top 3 to “Other”. - Parse cuisine and store-type strings into counts.
# Example: top countries, collapse others
top_countries <- data %>% count(REGISTRATION_COUNTRY, sort = TRUE) %>% slice_max(n, n = 3) %>% pull(REGISTRATION_COUNTRY)
data <- data %>% mutate(REGISTRATION_COUNTRY = if_else(REGISTRATION_COUNTRY %in% top_countries, REGISTRATION_COUNTRY, "Other"))
Part A: Num vs Num (correlation heatmap)
cor_matrix <- data %>% select(where(is.numeric)) %>% cor(use = "pairwise.complete.obs")
# ... reshape + ggplot heatmap
Key patterns:
- Purchase cluster: PURCHASE_COUNT, TOTAL_PURCHASES_EUR, PURCHASE_COUNT_DELIVERY, AVG/MAX_PURCHASE_VALUE_EUR, DISTINCT_PURCHASE_VENUE_COUNT.
- Recency: AVG_DAYS_BETWEEN_PURCHASES negatively correlates with spend/frequency.
- Platforms: iOS/Android/Web purchases mildly negatively correlated (single-platform preference).
- Cuisines & store types: small but meaningful co-variation; hints at functional segments.
Part B: Cat vs Num (box/violin)
- DNK shows highest average spending with many high-value outliers; FIN mid; GRC lowest.
- Spending fairly stable by hour/weekday with slight evening and lunch spikes.
- Web users exhibit occasional very high orders (>€600); iOS > Android on average.
- Valid payment method associates with more orders; outliers suggest edge cases.
Part C: Cat vs Cat
- DNK dominated by iOS; FIN & GRC lean Android; DNK shows notable Web usage.
- Weekday/hour profiles are broadly uniform; focus on longer horizons for segmentation.
- Android has more “no valid payment method” cases than Web.
Section 4: RFM, Relapse and CLV Analysis
Part A: RFM
Quartile-based RFM scoring; segments: Champions, Loyal, Potential, Low-value.
- Champions (16.5%): highest value, consistent returns; retention priority.
- Loyal (37.0%): strong revenue, consistent returns; engagement opportunity.
- Potential (10.3%): lower spend; nurture to grow.
- Low-value (36.2%): minimal engagement; consider onboarding nudges.
rfm_data <- data %>% transmute(USER_ID, recency = -AVG_DAYS_BETWEEN_PURCHASES, frequency = PURCHASE_COUNT, monetary = AVG_PURCHASE_VALUE_EUR)
rfm_scores <- rfm_data %>% mutate(
r_score = ntile(recency, 4), f_score = ntile(frequency, 4), m_score = ntile(monetary, 4),
rfm_total = r_score + f_score + m_score
) %>% mutate(
rfm_segment = case_when(rfm_total >= 10 ~ "Champions", rfm_total >= 7 ~ "Loyal", rfm_total >= 5 ~ "Potential", TRUE ~ "Low-value")
)
Part B: Relapse
- Tier 4 relapses too (long gaps, high value on return) — target re-engagement (e.g., GRC).
- Tier 3 has the most relapsed users; value differentiates more than frequency.
- Relapsed users often return for lunch, Thu ~18:00, mostly on iOS — a niche segment to target.
Part C: CLV
CLV ≈ (ARPU × Gross Margin) / Churn. (Used 2022 gross profit proxy; churn ≈ no purchase in 90 days as of late 2020.)
- Tier 3: strongest CLV (balance of spend & retention).
- Tier 4: highest spend but faster churn → lower long-term value.
- Tier 2: growth path to Tier 3; Tier 1 low ROI without strong signals.
Conclusion
- Relapse behavior: predictable windows (weekday 4, hour 18), iOS, lunch focus; valuable niche segment.
- Tier segmentation: Tier 3 most valuable; Tier 4 high-spend but churn-prone.
- CLV: Tier 3 > Tier 2 > Tier 4 > Tier 1, under current assumptions.
- Churn: elevated in Tier 1 & Tier 4; targeted engagement likely to help.
Recommendations
- Double down on Tier 3 (loyalty, referrals, personalization).
- Re-engage relapsed Tier 2–3 users at their top weekday/hour slots.
- Refine CLV with fresher data and survival models.
- Limit spend on Tier 1 unless signals suggest upsell potential.
Limitations
- Estimated Gross Margin: 2022 proxy used on older data could overstate profitability.
- Outdated Purchase Data: last-purchase fields skew churn estimates.
- Churn Definition: 90-day rule is a proxy; may miss seasonality.
- Relapse Classification: based on inactivity + repeat purchase; promo effects not isolated.
- Missing Cost Data: CLV excludes CAC/marketing; net value unknown.
- Inferred Behaviors: device/hour fields treated as behavioral; generation process unknown.
- Heuristic Segmentation: simple quantiles; may miss deeper patterns.