Experiment Name: ML-Powered Restaurant Recommendations V1
Owner: Ayush Saxena
Date: January 2026
Status: Ready for Launch
H1: Users with ML-powered personalized recommendations will have 40% lower time-to-order compared to users with the current "Sort by: Distance/Rating" approach.
Null Hypothesis (H0): There is no significant difference in time-to-order between treatment and control groups.
Alternative Hypothesis (H1): Treatment group has significantly lower time-to-order (one-tailed test).
H2: Treatment group will have 15% higher order conversion rate
H3: Treatment group will order from 2+ new restaurants per month
H4: Treatment group will have higher user satisfaction scores
| Metric | Baseline | Target | Minimum Detectable Effect |
|---|---|---|---|
| Time to Order | 10 minutes | 6 minutes | 2 minutes (20% reduction) |
Measurement:
time_to_order = order_placed_timestamp - home_page_load_timestamp
# Only count sessions where order was placed
# Exclude sessions >30 minutes (outliers)| Metric | Baseline | Target | Measurement |
|---|---|---|---|
| Order Conversion Rate | 12% | 13.8% (+15%) | Orders / Sessions |
| Discovery Rate | 1.2/month | 2.0/month | New restaurants ordered |
| Repeat Order Rate | 65% | 55% (-10%) | Orders from previous restaurants |
| User Satisfaction | 4.1/5 | 4.3/5 | Post-order survey rating |
| Metric | Baseline | Threshold | Action if Breached |
|---|---|---|---|
| Avg Delivery Time | 35 min | 38 min | Stop test |
| Cancellation Rate | 5% | 6% | Investigate |
| Negative Feedback | 8% | 10% | Investigate |
| Revenue per Order | โน380 | โน350 | Monitor |
Control Group (50%):
- Current experience: Sort by Distance/Rating/Delivery Time
- No personalization
- Generic restaurant listing
Treatment Group (50%):
- ML-powered personalized recommendations
- Top-10 highlighted section at top of feed
- Explanations for each recommendation
- Context-aware (time, weather, location)
Unit of Randomization: User ID
Randomization Method:
def assign_group(user_id):
# Deterministic hash-based assignment
hash_value = hashlib.md5(user_id.encode()).hexdigest()
hash_int = int(hash_value, 16)
if hash_int % 2 == 0:
return "control"
else:
return "treatment"Why Hash-Based?
- Deterministic: Same user always gets same experience
- Balanced: ~50-50 split
- No bias: Independent of user attributes
Parameters:
- Baseline time-to-order: 10 minutes (ฯ = 4 minutes)
- Minimum detectable effect: 2 minutes (20% reduction)
- Significance level (ฮฑ): 0.05
- Statistical power (1-ฮฒ): 0.80
Calculation:
from scipy.stats import ttest_ind_from_stats
import numpy as np
# Effect size
effect_size = 2 / 4 # 0.5 (Cohen's d)
# Sample size per group
n_per_group = (2 * (1.96 + 0.84)**2) / (effect_size**2)
n_per_group = 64
# Safety margin (1.5ร)
n_per_group_safe = 64 * 1.5 = 96
# Total users needed
total_users = 96 * 2 = 192Decision: Test with 10,000 users per group (safety margin for multiple metrics)
Calculation:
Daily active users: 5,000
Users per group: 2,500 (50%)
Required sample: 10,000 per group
Days needed = 10,000 / 2,500 = 4 days
Actual Duration: 14 days
Why 14 days?
- Capture full week (weekday + weekend behavior)
- Account for day-of-week effects
- Allow for metric stabilization
- Detect potential novelty effects
Events to Track:
# Page load
track_event("home_page_loaded", {
"user_id": user_id,
"timestamp": timestamp,
"test_group": "treatment", # or "control"
"context": {
"time_of_day": "dinner",
"weather": "clear",
"location": (lat, lon)
}
})
# Recommendation displayed
track_event("recommendations_shown", {
"user_id": user_id,
"timestamp": timestamp,
"restaurant_ids": [list_of_10],
"model_scores": [list_of_scores]
})
# Restaurant clicked
track_event("restaurant_clicked", {
"user_id": user_id,
"timestamp": timestamp,
"restaurant_id": restaurant_id,
"rank": 3, # Position in list
"explanation_viewed": True
})
# Order placed
track_event("order_placed", {
"user_id": user_id,
"timestamp": timestamp,
"restaurant_id": restaurant_id,
"order_value": 450,
"from_recommendation": True,
"recommendation_rank": 3
})# Feature flag configuration
feature_flags = {
"ml_recommendations": {
"enabled": True,
"rollout_percentage": 50,
"whitelist_users": [], # VIP users get treatment
"blacklist_users": [] # Exclude if issues
}
}
# Usage
if is_enabled("ml_recommendations", user_id):
recommendations = ml_model.recommend(user_id)
else:
recommendations = default_sort_by_rating()Phase 1: Internal (Week 1)
- 100 internal employees
- Goal: Catch critical bugs
- Success criteria: No crashes, latency <2s
Phase 2: Alpha (Week 2)
- 1% of users (5,000)
- Goal: Validate instrumentation
- Success criteria: Data logging works
Phase 3: Beta (Week 3-4)
- 10% of users (50,000)
- Goal: Detect any adverse effects
- Success criteria: Guardrails not breached
Phase 4: Full A/B Test (Week 5-6)
- 50% of users (treatment group)
- Goal: Measure impact on key metrics
- Success criteria: Primary metric improved
Phase 5: Full Launch (Week 7+)
- 100% of users
- Goal: Production deployment
- Success criteria: Sustained improvement
Primary Metric (Time to Order):
- Test: Two-sample t-test (one-tailed)
- Significance level: ฮฑ = 0.05
- Effect direction: Treatment < Control
from scipy.stats import ttest_ind
control_times = df[df['group'] == 'control']['time_to_order']
treatment_times = df[df['group'] == 'treatment']['time_to_order']
t_stat, p_value = ttest_ind(control_times, treatment_times, alternative='less')
if p_value < 0.05 and treatment_mean < control_mean:
conclusion = "Reject null hypothesis: Treatment is significantly better"
else:
conclusion = "Fail to reject null hypothesis"Secondary Metrics:
- Conversion rate: Two-proportion z-test
- Discovery rate: Two-sample t-test
- Satisfaction: Mann-Whitney U test (ordinal data)
Problem: Testing multiple metrics increases false positive risk
Solution: Bonferroni correction
# 1 primary + 4 secondary = 5 tests
corrected_alpha = 0.05 / 5 = 0.01
# Use ฮฑ = 0.01 for statistical significanceSubgroup Analysis:
-
By User Segment:
- New users (<3 orders)
- Regular users (3-20 orders)
- Power users (20+ orders)
-
By Time of Day:
- Breakfast, Lunch, Dinner, Late night
-
By City:
- If multi-city launch
# Example: Segment analysis
for segment in ['new_users', 'regular_users', 'power_users']:
segment_data = df[df['user_segment'] == segment]
control = segment_data[segment_data['group'] == 'control']
treatment = segment_data[segment_data['group'] == 'treatment']
effect = treatment['time_to_order'].mean() - control['time_to_order'].mean()
print(f"{segment}: {effect:.2f} minutes reduction")Check for:
- Novelty Effect: Does effect diminish over time?
# Compare Week 1 vs Week 2
week1_effect = calculate_effect(df[df['week'] == 1])
week2_effect = calculate_effect(df[df['week'] == 2])-
Simpson's Paradox: Is effect consistent across segments?
-
Outliers: Remove top/bottom 1% and re-test
| Condition | Primary Metric | Guardrails | Decision |
|---|---|---|---|
| โ Success | Improved โฅ20% | Not breached | LAUNCH |
| Improved 10-20% | Not breached | Launch with monitoring | |
| โ Neutral | Improved <10% | Not breached | Do not launch |
| ๐ Failure | Any | Breached | Stop immediately |
โ
Statistical Significance: p < 0.05 on primary metric
โ
Practical Significance: โฅ20% improvement on time-to-order
โ
Guardrail Metrics: None breached
โ
User Feedback: Net Promoter Score โฅ baseline
โ
Technical Stability: Latency <2s, Error rate <1%
Automatic Rollback If:
- Error rate >5% for 15 minutes
- Latency p95 >5 seconds for 30 minutes
- Order cancellation rate >10%
Manual Rollback If:
- Guardrail metrics breached
- Significant negative user feedback
- Business stakeholder concern
Metrics to Monitor:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Restaurant Recommendations A/B Test - Day 7/14 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ PRIMARY METRIC โ
โ Time to Order โ
โ Control: 10.2 min (n=8,523) โ
โ Treatment: 7.8 min (n=8,491) โ
โ ฮ: -2.4 min (-23.5%) โ
โ
โ p-value: 0.003 โ
โ
โ โ
โ SECONDARY METRICS โ
โ Conversion Rate: 12.1% โ 13.8% (+14%) โ
โ
โ Discovery Rate: 1.3 โ 1.8 (+38%) โ
โ
โ โ
โ GUARDRAILS โ
โ Delivery Time: 35.2 โ 36.1 (+0.9) โ
โ
โ Cancellation: 5.1% โ 5.3% (+0.2%) โ
โ
โ โ
โ TECHNICAL METRICS โ
โ Latency (p95): 1.8s โ
โ
โ Error Rate: 0.3% โ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Executive Summary (1 page):
- Test outcome: Success/Failure
- Primary metric result
- Recommendation: Launch/Don't launch
- Expected business impact
Detailed Analysis (5-10 pages):
- Experiment setup
- Sample characteristics
- Statistical results (all metrics)
- Subgroup analysis
- Qualitative feedback
- Technical performance
- Risks and mitigations
- Next steps
| Risk | Impact | Mitigation |
|---|---|---|
| Model latency >2s | High | Pre-computation, caching |
| Model failures | High | Fallback to popular restaurants |
| Data pipeline issues | Medium | Monitoring, alerts |
| Incorrect test assignment | High | Validation checks, unit tests |
| Risk | Impact | Mitigation |
|---|---|---|
| Restaurant complaints (unequal visibility) | Medium | Fair distribution monitoring |
| User backlash (privacy concerns) | High | Clear communication, opt-out |
| Revenue impact (lower AOV) | Medium | Track closely, set guardrails |
| Competitor copying | Low | Move fast, iterate |
| Risk | Impact | Mitigation |
|---|---|---|
| Selection bias | High | Randomization checks |
| Novelty effect | Medium | Extended test duration |
| Network effects | Low | User-level randomization |
| Simpson's paradox | Medium | Subgroup analysis |
Week 1: Intensive Monitoring
- Daily dashboard review
- Real-time alerts
- User feedback collection
Week 2-4: Regular Monitoring
- Weekly metric reviews
- Bi-weekly stakeholder updates
- Continuous model performance tracking
Quick Wins (Month 1-2):
- Tune model weights based on feedback
- Improve explanations clarity
- Fix edge cases discovered in production
Medium-term (Month 3-6):
- Add user feedback loop ("Not interested")
- Implement multi-armed bandit
- Real-time availability filtering
Long-term (6+ months):
- Deep learning models
- Multi-objective optimization
- Multi-city expansion
Audience: Engineering, Product, Business, Marketing
Message:
- What: ML-powered restaurant recommendations
- Why: Reduce decision fatigue, improve discovery
- When: 2-week A/B test starting [date]
- Impact: Potentially 40% faster ordering
Daily: Email summary to core team
Weekly: Presentation to leadership
Ad-hoc: Slack updates on significant changes
Success Announcement:
- Internal all-hands presentation
- External blog post (if appropriate)
- Case study for portfolio
-- Time to order for each user session
SELECT
user_id,
test_group,
session_id,
TIMESTAMPDIFF(SECOND,
home_page_load_time,
order_placed_time
) / 60.0 AS time_to_order_minutes
FROM sessions
WHERE order_placed_time IS NOT NULL
AND test_start_date <= order_placed_time
AND order_placed_time <= test_end_date
AND time_to_order_minutes < 30 -- Remove outliers-- Conversion rate
SELECT
test_group,
COUNT(DISTINCT CASE WHEN order_placed THEN session_id END) /
COUNT(DISTINCT session_id) AS conversion_rate
FROM sessions
WHERE test_start_date <= session_start_time
GROUP BY test_groupDocument Owner: Ayush Saxena
Review Date: Pre-Launch
Status: Approved for Testing