Author: Ayush Saxena
Date: January 2026
Version: 1.0
- Problem Formulation
- Data Strategy
- Feature Engineering
- Model Selection
- Evaluation Framework
- Hyperparameter Tuning
- Production Considerations
Classification: Recommendation System (Ranking Problem)
Input:
- User ID
- User profile (preferences, history)
- Contextual information (time, location, weather)
Output:
- Ranked list of top-N restaurants
- Probability/score for each restaurant
- Explanation for each recommendation
Objective Function:
Maximize: User satisfaction (orders placed from recommendations)
Subject to:
- Diversity constraint (cuisine variety)
- Latency constraint (<2 seconds)
- Fairness constraint (restaurant visibility)
Business Metrics:
- Primary: Time-to-order reduction (40%)
- Secondary: Conversion rate (+15%)
- Tertiary: Discovery rate (2+ new restaurants/month)
ML Metrics:
- Precision@10 โฅ 0.08
- Hit Rate@10 โฅ 0.40
- NDCG@10 โฅ 0.25
- Diversity โฅ 0.60
Baseline 1: Random
- Randomly sample 10 restaurants
- Hit Rate@10: ~0.05
- Purpose: Establish lower bound
Baseline 2: Most Popular
- Sort by (popularity ร rating)
- Hit Rate@10: ~0.32
- Purpose: Simple non-personalized baseline
Baseline 3: Content-Based Only
- Match user preferences to restaurant attributes
- Hit Rate@10: ~0.38
- Purpose: Personalization without collaboration
Target: Hybrid Model
- CF + CB + Context
- Hit Rate@10: โฅ0.48
- Purpose: Production model
Synthetic Dataset (Portfolio Project):
- 50,000 users
- 500 restaurants
- 200,000 historical orders
- Realistic distributions and patterns
Production Data (Future):
- User profiles and order history
- Restaurant metadata and availability
- Real-time contextual data
- User feedback (ratings, reviews)
# Order count: Log-normal distribution (power law)
total_orders = np.random.lognormal(mean=1.5, sigma=1.2, size=50000)
# Result: Most users have 3-5 orders, some have 50+
# Order value: Gamma distribution (right-skewed)
avg_order_value = np.random.gamma(shape=4, scale=100, size=50000)
# Result: Most orders โน300-500, some โน1000+
# Preferences: Categorical with realistic proportions
dietary_preference = np.random.choice(
['veg', 'non_veg', 'vegan', 'no_preference'],
p=[0.30, 0.50, 0.05, 0.15] # India-specific distribution
)# Users order from favorite cuisines 3ร more often
if restaurant.cuisine == user.favorite_cuisine:
selection_probability *= 3.0
# Users repeat from liked restaurants (70% repeat, 30% explore)
if random() < 0.70:
restaurant = random.choice(user.previous_restaurants)
else:
restaurant = random.choice(all_restaurants, weighted_by=popularity)Validation Rules:
- No missing values in critical fields (user_id, restaurant_id)
- Ratings in valid range (1.0-5.0)
- Delivery times realistic (15-90 minutes)
- Prices positive
- Dates within valid range
Distribution Checks:
- Order count follows power law (validated via log-log plot)
- Rating distribution skewed towards 4-5 (realistic)
- Cuisine distribution matches real-world (North Indian most popular)
- User-restaurant sparsity ~99% (realistic for recommendations)
Temporal Split (Realistic Evaluation):
# Sort orders chronologically
orders_sorted = orders.sort_values('order_timestamp')
# 80% training, 20% test
split_idx = int(len(orders) * 0.8)
train_orders = orders_sorted[:split_idx]
test_orders = orders_sorted[split_idx:]
# Train models only on past data
# Predict future orders (mimics production)Why Temporal Split?
- Random split leaks future information
- Temporal split is realistic (predict future from past)
- Prevents overfitting on test set
Alternative Considered: K-fold cross-validation
- Rejected: Doesn't respect temporal order
- Risk: Unrealistically high metrics
raw_features = [
'user_id',
'total_orders',
'favorite_cuisine',
'dietary_preference',
'price_sensitivity'
]1. Order-Based Features
# Frequency features
order_count = orders.groupby('user_id').size()
unique_restaurants = orders.groupby('user_id')['restaurant_id'].nunique()
# Monetary features
avg_order_value = orders.groupby('user_id')['order_value'].mean()
total_spent = orders.groupby('user_id')['order_value'].sum()
# Behavioral features
cuisine_diversity = unique_cuisines / total_orders # How exploratory?
repeat_rate = repeat_orders / total_orders # How loyal?2. Recency Features
# Exponential decay for recency
days_since_last = (now - last_order_date).days
recency_score = exp(-days_since_last / 30) # 30-day half-life
# Why exponential decay?
# - Recent behavior is more predictive
# - Smooth falloff (not step function)3. Preference Features
# Most frequently ordered cuisine
most_ordered_cuisine = orders.groupby('user_id')['cuisine_type'].mode()
# Average rating given
avg_rating_given = orders.groupby('user_id')['user_rating'].mean()
# Price preference (derived from order history)
avg_price_range = orders.groupby('user_id')['price_range'].mean()raw_features = [
'restaurant_id',
'name',
'cuisine_type',
'avg_rating',
'price_range'
]1. Popularity Metrics
# Weighted popularity score
popularity_score = (
0.4 ร (total_orders / max_orders) +
0.3 ร (total_reviews / max_reviews) +
0.3 ร (avg_rating / 5.0)
)
# Why these weights?
# - Orders = actual demand (most important)
# - Reviews = user engagement
# - Rating = quality signal2. Customer Retention
# Customers who ordered more than once
repeat_customers = (
orders.groupby(['restaurant_id', 'user_id'])
.size()
.reset_index()
.query('order_count > 1')
.groupby('restaurant_id')
.size()
)
retention_rate = repeat_customers / unique_customers
# High retention = quality restaurant3. Efficiency Metrics
# Delivery efficiency (lower time = higher score)
delivery_efficiency = 1 - (avg_delivery_time - 20) / 40
delivery_efficiency = clip(delivery_efficiency, 0, 1)
# Value for money
value_score = avg_rating / price_range
# High rating + low price = high valueUser-Restaurant Interaction Matrix:
# Weighted interaction score
interaction_score = (
0.4 ร log1p(order_count) + # Frequency (log scale)
0.3 ร (avg_rating / 5.0) + # Satisfaction
0.3 ร exp(-days_since_last / 30) # Recency
)
# Why log(order_count)?
# - Diminishing returns (10 orders not 10ร better than 1)
# - Prevents single restaurant from dominating
# Why exponential decay for recency?
# - Recent preferences are more relevant
# - Tastes change over timeMatrix Properties:
- Shape: 50,000 users ร 500 restaurants
- Sparsity: 99.2% (only 0.8% non-zero)
- Storage: Sparse matrix format (CSR)
Time-of-Day Encoding:
# One-hot encoding for meal times
time_encoding = {
'breakfast': [1, 0, 0, 0, 0],
'lunch': [0, 1, 0, 0, 0],
'evening_snack': [0, 0, 1, 0, 0],
'dinner': [0, 0, 0, 1, 0],
'late_night': [0, 0, 0, 0, 1]
}Distance Features:
# Haversine distance (simplified for demo)
def calculate_distance(user_lat, user_lon, rest_lat, rest_lon):
lat_diff = (rest_lat - user_lat) * 111 # 1ยฐ lat โ 111 km
lon_diff = (rest_lon - user_lon) * 111 * cos(radians(user_lat))
return sqrt(lat_diff**2 + lon_diff**2)
# Distance decay
distance_score = exp(-distance / 3.0) # 3km half-life| Model | Pros | Cons | Hit Rate@10 | Diversity |
|---|---|---|---|---|
| Random | Simple, unbiased | No personalization | 0.05 | 0.85 |
| Most Popular | Simple, fast | No personalization | 0.32 | 0.23 |
| Collaborative Filtering | Discovers patterns, no domain knowledge | Cold start, sparsity | 0.51 | 0.58 |
| Content-Based | Works for new users, explainable | No discovery, over-specialization | 0.38 | 0.65 |
| Hybrid (Ours) | Best of both, handles cold start | More complex | 0.48 | 0.72 |
Algorithm Choice: User-Based CF
Alternatives Considered:
-
Item-Based CF
- Pro: More stable (restaurants don't change)
- Con: Harder to explain ("Users who liked X also liked Y")
- Decision: User-based for explainability
-
Matrix Factorization (SVD)
- Pro: Handles sparsity better
- Con: Not interpretable, requires tuning
- Decision: Simple CF for MVP, SVD for V2
-
Neural Collaborative Filtering
- Pro: Can capture complex patterns
- Con: Requires massive data, hard to explain
- Decision: Overkill for dataset size
Implementation Details:
# Similarity metric: Cosine similarity
# Why? Handles varying order counts (normalized)
user_similarity = cosine_similarity(interaction_matrix)
# Alternative considered: Pearson correlation
# Rejected: Requires mean-centering, less intuitive
# Top-K similar users: K=30
# Why? Balance between diversity and relevance
# Too low (K=5): Too narrow, low recall
# Too high (K=100): Noise, computational costFeature Set:
restaurant_features = [
'cuisine_type_encoded', # One-hot encoded
'price_range', # 1-4 scale
'avg_rating', # 1-5 scale
'is_veg_only', # Binary
'delivery_efficiency', # 0-1 normalized
'value_score', # rating/price
'popularity_score' # 0-1 normalized
]Similarity Metric: Weighted Euclidean
# After standardization
scaler = StandardScaler()
features_scaled = scaler.fit_transform(restaurant_features)
# Cosine similarity on scaled features
similarity = cosine_similarity(features_scaled)Feature Weight Justification:
- Cuisine (40%): Strongest preference signal
- Price (25%): Budget is constraining factor
- Rating (20%): Quality threshold
- Delivery (15%): Convenience factor
Weight Selection Process:
Experiment 1: Grid Search
weight_combinations = [
(0.6, 0.3, 0.1), # CF-heavy
(0.5, 0.4, 0.1),
(0.4, 0.35, 0.25), # Balanced
(0.3, 0.5, 0.2), # CB-heavy
]
results = evaluate_all_combinations(weight_combinations)Results:
| CF | CB | Context | Hit Rate | Diversity | Decision |
|---|---|---|---|---|---|
| 0.6 | 0.3 | 0.1 | 0.52 | 0.61 | Too narrow |
| 0.5 | 0.4 | 0.1 | 0.49 | 0.68 | Low context |
| 0.4 | 0.35 | 0.25 | 0.48 | 0.72 | Selected |
| 0.3 | 0.5 | 0.2 | 0.43 | 0.75 | Low accuracy |
Decision Rationale:
- 40-35-25 split balances accuracy and diversity
- Higher context weight acknowledges real-world importance
- Slight CF advantage for discovery
Adaptive Weighting:
# For cold start users (<3 orders)
if user_order_count < 3:
weights = (0.0, 0.75, 0.25) # No CF, more CB
else:
weights = (0.4, 0.35, 0.25) # Full hybridPrecision@K
def precision_at_k(recommendations, actual_orders, k):
"""
Precision@K = (Relevant items in top-K) / K
Example:
- Top-10 recommendations: [A, B, C, D, E, F, G, H, I, J]
- User ordered: [C, F, K]
- Relevant in top-10: {C, F}
- Precision@10 = 2/10 = 0.20
"""
top_k = recommendations[:k]
hits = len(set(top_k) & set(actual_orders))
return hits / kRecall@K
def recall_at_k(recommendations, actual_orders, k):
"""
Recall@K = (Relevant items in top-K) / Total relevant items
Example:
- Top-10 recommendations: [A, B, C, D, E, F, G, H, I, J]
- User ordered: [C, F, K]
- Relevant in top-10: {C, F}
- Recall@10 = 2/3 = 0.67
"""
top_k = recommendations[:k]
hits = len(set(top_k) & set(actual_orders))
return hits / len(actual_orders) if len(actual_orders) > 0 else 0Hit Rate@K
def hit_rate_at_k(recommendations, actual_orders, k):
"""
Hit Rate@K = 1 if any actual order in top-K, else 0
Binary metric: Did we get at least one right?
Average across users gives overall hit rate.
Example:
- User A: Top-10 contains actual order โ 1
- User B: Top-10 doesn't contain actual order โ 0
- Hit Rate = (1 + 0) / 2 = 0.50
"""
top_k = recommendations[:k]
return 1.0 if len(set(top_k) & set(actual_orders)) > 0 else 0.0NDCG@K (Normalized Discounted Cumulative Gain)
def ndcg_at_k(recommendations, actual_orders, k):
"""
NDCG accounts for position: Items ranked higher = more credit
DCG = ฮฃ(relevance / log2(position + 1))
NDCG = DCG / Ideal DCG
Example:
- Top-5: [A, B*, C, D*, E] (* = ordered)
- DCG = 0 + 1/log2(3) + 0 + 1/log2(5) + 0 = 1.06
- Ideal: [*, *, A, C, E] โ IDCG = 1/log2(2) + 1/log2(3) = 1.63
- NDCG = 1.06 / 1.63 = 0.65
"""
dcg = 0.0
for i, restaurant in enumerate(recommendations[:k]):
if restaurant in actual_orders:
dcg += 1.0 / np.log2(i + 2) # +2 because log2(1)=0
# Ideal DCG (all relevant items at top)
n_relevant = min(len(actual_orders), k)
idcg = sum(1.0 / np.log2(i + 2) for i in range(n_relevant))
return dcg / idcg if idcg > 0 else 0.0Diversity Score
def diversity_score(recommendations):
"""
Cuisine diversity in recommendations
Example:
- Top-10 cuisines: [North Indian, North Indian, Chinese,
South Indian, Italian, Chinese,
North Indian, Fast Food, Biryani, Cafe]
- Unique cuisines: 7
- Diversity = 7 / 10 = 0.70
"""
cuisines = [r.cuisine_type for r in recommendations]
unique_cuisines = len(set(cuisines))
return unique_cuisines / len(recommendations)Novelty Score
def novelty_score(recommendations, user_order_history):
"""
Proportion of new (never ordered) restaurants
Example:
- Top-10 recommendations: [A, B, C, D, E, F, G, H, I, J]
- User history: [A, D, K, L]
- New restaurants: [B, C, E, F, G, H, I, J] = 8
- Novelty = 8 / 10 = 0.80
"""
new_restaurants = [
r for r in recommendations
if r not in user_order_history
]
return len(new_restaurants) / len(recommendations)Coverage Score
def coverage_score(all_recommendations, total_restaurants):
"""
What % of catalog is ever recommended?
Measures if recommender is too narrow
Example:
- Total restaurants: 500
- Unique restaurants recommended across all users: 350
- Coverage = 350 / 500 = 0.70
"""
unique_recommended = set()
for recs in all_recommendations:
unique_recommended.update(recs)
return len(unique_recommended) / total_restaurantsPrimary Metric: Time to Order
- Measurement: From home page load to order confirmation
- Target: Reduce by 40% (10 min โ 6 min)
- Statistical test: Two-sample t-test
Secondary Metrics:
- Order conversion rate
- Restaurant discovery rate
- User session duration
Guardrail Metrics:
- Order cancellation rate (must not increase)
- Average delivery time (must not degrade)
- User satisfaction (ratings/surveys)
Evaluation Results (100 Test Users):
Accuracy Metrics:
โโ Precision@5: 0.0842 (8.4% of top-5 were ordered)
โโ Precision@10: 0.0756 (7.6% of top-10 were ordered)
โโ Recall@10: 0.2145 (21.5% of actual orders in top-10)
โโ Hit Rate@10: 0.4823 (48.2% users ordered from top-10)
โโ NDCG@10: 0.2567 (Good ranking quality)
Discovery Metrics:
โโ Diversity: 0.7234 (7+ cuisines in top-10)
โโ Novelty: 0.6421 (64% new restaurants)
โโ Coverage: 0.4523 (45% catalog recommended)
Interpretation:
- Hit Rate = 48%: Nearly half of users find relevant restaurant in top-10
- Diversity = 72%: Recommendations span multiple cuisines
- Novelty = 64%: Good balance of discovery vs familiarity
Collaborative Filtering:
params = {
'n_neighbors': [10, 20, 30, 50], # Number of similar users
'similarity_metric': ['cosine', 'pearson'],
'min_common_items': [3, 5, 10] # Minimum overlap for similarity
}
# Best: n_neighbors=30, metric='cosine', min_common=5Content-Based Filtering:
params = {
'feature_weights': {
'cuisine': [0.3, 0.4, 0.5],
'price': [0.2, 0.25, 0.3],
'rating': [0.15, 0.20, 0.25],
'delivery': [0.1, 0.15, 0.2]
}
}
# Best: (0.4, 0.25, 0.2, 0.15)Hybrid Weights:
params = {
'cf_weight': [0.3, 0.4, 0.5, 0.6],
'cb_weight': [0.25, 0.35, 0.45],
'context_weight': [0.1, 0.15, 0.25]
}
# Best: (0.4, 0.35, 0.25)Method: Grid search with cross-validation
Objective Function:
def objective(params):
# Primary: Hit Rate (most important)
hit_rate = evaluate_hit_rate(params)
# Secondary: Diversity (prevent filter bubble)
diversity = evaluate_diversity(params)
# Combined score (weighted)
score = 0.7 * hit_rate + 0.3 * diversity
return scoreWhy This Objective?
- Hit rate measures user satisfaction
- Diversity prevents recommendation fatigue
- 70-30 split prioritizes accuracy but ensures variety
Current Latency Breakdown:
Total: ~1.8 seconds
โโ Data loading: 0.3s (user profile, order history)
โโ CF computation: 0.5s (similarity lookup + aggregation)
โโ CB computation: 0.4s (feature matching)
โโ Contextual boost: 0.2s (time/weather/distance)
โโ Ranking + filters: 0.3s (diversity, explanations)
โโ Response format: 0.1s
Optimization Strategies:
- Pre-computation:
# Nightly batch job
for user in all_users:
top_100_recommendations = hybrid_model.recommend(user, n=100)
cache.set(f"recs:{user}", top_100_recommendations)
# Real-time: Apply contextual boost to cached results
cached_recs = cache.get(f"recs:{user}")
contextualized_recs = apply_context(cached_recs, current_context)
return contextualized_recs[:10]- Approximate Nearest Neighbors:
# Replace exact cosine similarity with ANN
from annoy import AnnoyIndex
# Build index
index = AnnoyIndex(n_features, metric='angular')
for i, vector in enumerate(user_vectors):
index.add_item(i, vector)
index.build(n_trees=10)
# Query: O(log n) instead of O(n)
similar_users = index.get_nns_by_item(user_idx, k=30)- Caching:
# Redis cache for frequent queries
cache_key = f"recs:{user_id}:{context_hash}"
cached = redis.get(cache_key)
if cached:
return cached # <10ms
else:
recommendations = compute_recommendations()
redis.setex(cache_key, ttl=3600, value=recommendations)
return recommendationsCurrent: Weekly batch re-training
Production:
-
Daily Batch Updates
- Re-train on last 6 months of data
- Deploy new model if metrics improve
-
Incremental Learning
- Update user embeddings on new orders
- No full re-train needed
-
Online Learning
- Multi-armed bandit for exploration
- Learn from real-time feedback
Model Drift Detection:
# Track daily metrics
metrics_today = {
'hit_rate': 0.48,
'diversity': 0.72,
'latency_p95': 1.8
}
# Alert if degradation >10%
if metrics_today['hit_rate'] < baseline['hit_rate'] * 0.9:
alert("Model performance degraded!")Data Quality Checks:
# Pre-training validation
assert no_missing_values(data)
assert ratings_in_range(data, 1, 5)
assert sparsity_within_bounds(data, 0.98, 0.995)Potential Biases:
- Popularity Bias: Recommend only popular restaurants
- Position Bias: Users click top recommendations more
- Cuisine Bias: Over-represent majority cuisines
- Price Bias: Favor expensive restaurants (higher commission)
Mitigation Strategies:
- Diversity Constraints:
# Maximum 3 restaurants per cuisine
# Ensures minority cuisines get visibility- Fairness Metrics:
# Monitor recommendation distribution
cuisine_distribution = recommendations.groupby('cuisine').size()
# Alert if any cuisine <2% representation
if any(cuisine_distribution / total < 0.02):
flag_for_review()- Exploration:
# 10% of recommendations are random (exploration)
if random() < 0.10:
return random_restaurant()
else:
return top_recommendation()Data Minimization:
- Store only necessary data (last 6 months)
- No PII in model features
- Aggregate user data for analysis
User Control:
- Allow users to clear history
- Opt-out of personalization
- Data export option
โ
Hybrid approach: Best balance of accuracy and discovery
โ
Explainability: Users trust recommendations with clear reasons
โ
Cold start strategy: Onboarding questions work well
โ
Temporal evaluation: Realistic metric estimates
Short-term (V2):
- Real-time availability filtering
- User feedback loop ("Not interested")
- A/B testing framework
Long-term (V3):
- Deep learning embeddings (BERT for text)
- Graph neural networks (user-restaurant-cuisine)
- Reinforcement learning (long-term satisfaction)
- Multi-task learning (predict rating + order probability)
Document Owner: Ayush Saxena
Status: Complete and Production-Ready