This documnetation helps you deploy your own instance of the Prepare Dataset Space on Hugging Face. This automated pipeline fetches misclassified grievances from your database, prepares training datasets, and pushes them to Hugging Face Hub for model retraining.
- System Overview
- Architecture Components
- Pipeline Workflow
- Technology Stack
- Implementation Details
- Deployment Architecture
- Quick Start Guide
- Monitoring & Metrics
- Troubleshooting
- Integration with Retraining Pipeline
- Security Considerations
- Conclusion
- Docment Metadata
- Automated Data Preparation: Fetch, balance, and preprocess grievance data
- Database Integration: Direct PostgreSQL connection for misclassified records
- Balanced Sampling: Combines misclassified (100%) + correct samples (50% ratio)
- Version Control: Automatic dataset versioning with timestamp-based tags
- Cost-Effective: Runs on-demand, auto-pauses after completion
- Experiment Tracking: Full integration with Weights & Biases for monitoring
Your implementation uses a dedicated HF Space for dataset preparation that:
- Runs on-demand when triggered
- Fetches reviewed misclassifications from PostgreSQL
- Balances data with correct samples
- Automatically pushes versioned datasets to HF Hub
- Pauses itself after completion to save costs
flowchart TD
%% Styling
classDef processStyle fill:#e3f2fd,stroke:#1976d2,stroke-width:1px,font-size:14px
classDef decisionStyle fill:#fff3e0,stroke:#f57c00,stroke-width:2px,font-size:14px
classDef infraStyle fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,font-size:14px
%% Nodes
SpaceRestart["<b>TRIGGER</b><br/>HF Space Restarts<br/>Automated pipeline begins"]
class SpaceRestart infraStyle
LoadConfig["<b>LOAD CONFIGURATION</b><br/>Validate environment variables<br/>Set DB connection & HF tokens<br/>Configure dataset paths"]
class LoadConfig processStyle
DBConnect["<b>DATABASE CONNECTION</b><br/>Connect to PostgreSQL<br/>Retry logic (3 attempts)<br/>Validate connection"]
class DBConnect processStyle
FetchData["<b>FETCH DATA</b><br/>Query misclassified records<br/>Sample correct records (50%)<br/>Combine & validate"]
class FetchData processStyle
CheckSize{"<b>SIZE CHECK</b><br/>Records ≥<br/>MIN_DATASET_LEN?"}
class CheckSize decisionStyle
Preprocess["<b>PREPROCESS</b><br/>Clean text (URLs, HTML, whitespace)<br/>Encode labels to IDs<br/>Create train/eval/test splits"]
class Preprocess processStyle
PushHub["<b>PUSH TO HUB</b><br/>Upload dataset to HF<br/>Create version tag<br/>Generate README & metadata"]
class PushHub processStyle
Skip["<b>SKIP</b><br/>Log insufficient data<br/>Continue to next label"]
class Skip processStyle
PauseSpace["<b>SHUTDOWN</b><br/>SENDS EMAIL</br>Pause Space<br/>Free resources<br/>Pipeline complete"]
class PauseSpace infraStyle
%% Flow
SpaceRestart --> LoadConfig
LoadConfig --> DBConnect
DBConnect --> FetchData
FetchData --> CheckSize
CheckSize -->|Yes| Preprocess
CheckSize -->|No| Skip
Preprocess --> PushHub
Skip --> PauseSpace
PushHub --> PauseSpace
Fig: Dataset Preparation Pipeline
- Purpose: Store grievances and misclassification feedback
- Tables:
complaints- All grievances with predictionsmisclassified_complaints- Reviewed corrections
- Query Logic: Fetch reviewed misclassifications + sample correct records
- Purpose: Automated dataset preparation pipeline
- Trigger: Manual restart or API call
- Runtime: Docker container with Python 3.12
- State: Paused when idle, active only during processing
- Output: Versioned datasets pushed to HF Hub
- Purpose: Version-controlled training data storage
- Format: HuggingFace Dataset with train/eval/test splits
- Metadata: Version tags, split sizes, label mappings, timestamps
- Purpose: Pipeline monitoring and data quality tracking
- Logged Data:
- Database connection status
- Records fetched per label
- Dataset push status (success/failed/skipped)
- Pipeline alerts and errors
- Notify User - Send Run Summary Email to Admin
- User or scheduler initiates Space restart
- HF Space receives signal
- Docker container begins building
- Validate all environment variables
- Check HF_TOKEN, POSTGRES_URL, WANDB_API_KEY
- Configure dataset repository paths
- Set minimum dataset size threshold
- Create SQLAlchemy engine with connection pooling
- Retry logic: 3 attempts with exponential backoff (2s, 4s, 8s)
- Validate connection with
SELECT 1query - Log connection status to WandB
- Query misclassified records:
SELECT c.message, mc.correct_department, mc.correct_urgency FROM misclassified_complaints mc JOIN complaints c ON c.id = mc.complaint_id WHERE mc.reviewed = TRUE AND mc.correct_X IS NOT NULL AND mc.model_predicted_X IS DISTINCT FROM mc.correct_X
- Sample correct records (50% of misclassified count):
SELECT c.message, c.department, c.urgency FROM complaints c WHERE c.id NOT IN (SELECT complaint_id FROM misclassified_complaints)
- Combine both datasets
- Log record counts to WandB
- Compare:
record_count >= MIN_DATASET_LEN(default: 1000) - If insufficient: Skip push, log warning, continue to next label
- If sufficient: Proceed to preprocessing
- Clean text:
- Remove URLs:
https://...→ `` - Remove HTML tags:
<div>→ `` - Normalize whitespace:
→
- Remove URLs:
- Encode labels:
- Department → {0, 1, 2, 3}
- Urgency → {0, 1, 2}
- Create splits:
- Train: 80%
- Eval: 10%
- Test: 10%
- Stratified by label
- Generate version tag:
v{YYYYMMDD}_{HHMMSS} - Push DatasetDict to HF Hub
- Upload
dataset_metadata.json:{ "dataset_name": "username/dataset", "version_tag": "v20251029_143052", "label_column": "department", "num_samples": 1523, "splits": {"train": 1218, "eval": 152, "test": 153} } - Generate README.md with label mappings and metadata
- Create Git tag on HF Hub
- Log success to WandB
- WandB Send Email to Notify the Admin
- Finish WandB run
- Pause Space (free resources)
- Log completion status
- Pipeline complete
Core Framework:
- Python 3.12
- SQLAlchemy (PostgreSQL ORM)
- Pandas (data manipulation)
- Hugging Face Datasets
- scikit-learn (train/test split)
Key Files:
prepare_dataset_pipeline.py- Main orchestratorprepare_pd_df.py- Database query logicpreprocess_and_prepare_dataset.py- Preprocessing & HF Hub pushDockerfile- Container configurationrequirements.txt- Python dependencies
Docker Setup:
- Base:
python:3.12-slim - Non-root user for security
- Cached HF artifacts in
/home/user/app/hf_cache
Environment Variables:
# Hugging Face
HF_TOKEN=<write_access_token>
DEPARTMENT_DATASET=<username>/sambodhan-department-dataset
URGENCY_DATASET=<username>/sambodhan-urgency-dataset
# Database
POSTGRES_URL=postgresql://user:pass@host:port/database
# Weights & Biases
WANDB_API_KEY=<wandb_key>
WANDB_PROJECT_NAME=sambodhan-dataset-pipeline
# Optional
PREPARE_DATASET_SPACE_ID=<username>/prepare-dataset
MIN_DATASET_LEN=1000Label Mappings:
Department Classification:
department2id = {
'Municipal Governance & Community Services': 0,
'Education, Health & Social Welfare': 1,
'Infrastructure, Utilities & Natural Resources': 2,
'Security & Law Enforcement': 3
}Urgency Classification:
urgency2id = {
'NORMAL': 0,
'URGENT': 1,
'HIGHLY URGENT': 2
}Balanced Dataset Creation:
- Misclassified: 100% (all reviewed records where prediction ≠ correct)
- Correct: 50% of misclassified count (randomly sampled)
- Example: 1000 misclassified + 500 correct = 1500 total
Train/Eval/Test Split:
- Train: 80% (stratified by label)
- Eval: 10% (stratified by label)
- Test: 10% (stratified by label)
Purpose: Fetch and balance dataset from PostgreSQL
Parameters:
label_column: 'department' or 'urgency'engine: SQLAlchemy enginecorrect_ratio: Fraction of correct samples (default: 0.5)random_state: Random seed for reproducibility (default: 42)
Returns: DataFrame with columns ['grievance', 'department', 'urgency']
Logic:
- Query all misclassified records (reviewed = TRUE)
- Calculate correct sample size:
n_correct = n_misclassified * 0.5 - Sample correct records from complaints not in misclassified table
- Combine and return
Purpose: Clean text and encode labels
Text Cleaning:
def clean_text(text: str) -> str:
text = re.sub(r'https?://\S+|www\.\S+', '', text) # URLs
text = re.sub(r'<.*?>', '', text) # HTML tags
text = re.sub(r'\n', ' ', text) # Newlines
text = re.sub(r'\s+', ' ', text).strip() # Whitespace
return textLabel Encoding:
- Map department/urgency strings to integer IDs
- Drop rows with null labels
- Return DataFrame:
['grievance', 'label']
Purpose: Create train/eval/test splits
Implementation:
def split_dataset(df, train_size=0.8, val_size=0.1, test_size=0.1):
# First split: train + temp
train_df, temp_df = train_test_split(
df, test_size=(val_size + test_size),
stratify=df['label'], random_state=42
)
# Second split: val + test
val_df, test_df = train_test_split(
temp_df, test_size=(test_size / (val_size + test_size)),
stratify=temp_df['label'], random_state=42
)
return DatasetDict({
'train': Dataset.from_pandas(train_df),
'eval': Dataset.from_pandas(val_df),
'test': Dataset.from_pandas(test_df)
})Purpose: Complete preprocessing and HF Hub upload
Workflow:
- Clean and encode dataset
- Create train/eval/test splits
- Generate version tag (timestamp-based)
- Push to HF Hub with commit message
- Upload
dataset_metadata.json - Generate and upload README.md
- Create Git tag
- Return DatasetDict
prepare_dataset/
├── prepare_dataset_pipeline.py # Main orchestrator
├── prepare_pd_df.py # Database queries
├── preprocess_and_prepare_dataset.py # Preprocessing & push
├── requirements.txt
├── Dockerfile
└── README.md
Dockerfile:
FROM python:3.12-slim
RUN useradd -m -u 1000 user
USER user
ENV HOME=/home/user \
PATH="/home/user/.local/bin:$PATH"
WORKDIR /home/user/app
COPY --chown=user requirements.txt /home/user/app/requirements.txt
RUN pip install --upgrade pip \
&& pip install --no-cache-dir -r requirements.txt
ENV HF_HOME=/home/user/app/hf_cache \
HF_DATASETS_CACHE=/home/user/app/hf_cache \
HF_METRICS_CACHE=/home/user/app/hf_cache
RUN mkdir -p /home/user/app/hf_cache && chmod -R 777 /home/user/app/hf_cache
COPY --chown=user . /home/user/app
CMD ["python", "prepare_dataset_pipeline.py"]Key Features:
- Non-root user for security
- Cached HF artifacts
- Slim base image
- PostgreSQL driver (psycopg2-binary)
Dataset Versioning:
- Format:
v{YYYYMMDD}_{HHMMSS} - Example:
v20251029_143052 - Stored as Git tags on HF Hub
- Immutable once created
Metadata Tracking:
{
"dataset_name": "username/sambodhan-department-dataset",
"version_tag": "v20251029_143052",
"label_column": "department",
"created_at": "2025-10-29T14:30:52.123456",
"num_samples": 1523,
"splits": {
"train": 1218,
"eval": 152,
"test": 153
},
"author": "mr-kush",
"description": "Processed and versioned dataset for department classification."
}- Hugging Face Account: https://huggingface.co/join
- HF Write Token: Settings → Access Tokens → New token (write)
- WandB Account: https://wandb.ai/signup
- WandB API Key: Settings → API keys
- PostgreSQL Database: With schema described in Architecture Components
Create two empty dataset repositories on HF Hub:
- Go to https://huggingface.co/new-dataset
- Create
your-username/sambodhan-department-dataset - Create
your-username/sambodhan-urgency-dataset - Select "Public" or "Private" visibility
# Clone the prepare dataset space
git clone https://huggingface.co/spaces/sambodhan/prepare_dataset
# Navigate to directory
cd prepare_dataset# Remove original git remote
git remote remove origin
# Create your new Space on HF (via web interface first)
# Go to https://huggingface.co/new-space
# Choose "Docker" SDK, select "CPU basic"
# Then link it:
git remote add origin https://huggingface.co/spaces/your-username/prepare-dataset
# Configure git
git config user.email "your-email@example.com"
git config user.name "Your Name"Go to your Space settings on Hugging Face and add these Secrets:
# Hugging Face Configuration
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx
# Dataset Repositories (create these first!)
DEPARTMENT_DATASET=your-username/sambodhan-department-dataset
URGENCY_DATASET=your-username/sambodhan-urgency-dataset
# Database Connection
POSTGRES_URL=postgresql://user:pass@host:5432/dbname
# Weights & Biases
WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WANDB_PROJECT_NAME=sambodhan-dataset-pipeline
# Optional: This Space's own ID (for auto-pause)
PREPARE_DATASET_SPACE_ID=your-username/prepare-dataset
# Optional: Minimum dataset size (default: 1000)
MIN_DATASET_LEN=1000# Add all files
git add .
# Commit
git commit -m "Initial setup: Configure dataset preparation space"
# Push to your Space
git push origin mainOption A: Manual Restart (via UI)
- Go to your Space URL
- Click "Restart Space" button
- Monitor logs in real-time
Option B: Programmatic Restart
from huggingface_hub import HfApi
api = HfApi()
api.restart_space(
repo_id="your-username/prepare-dataset",
token="your_hf_token"
)Check HF Dataset Hub:
- Visit
https://huggingface.co/datasets/your-username/sambodhan-department-dataset - Verify version tag exists (e.g.,
v20251029_143052) - Check
dataset_metadata.jsonfile - Review generated README.md
Check WandB:
- Go to
https://wandb.ai/your-username/sambodhan-dataset-pipeline - View latest run
- Check metrics:
department_records_fetched,urgency_records_fetched - Verify push status:
successorskipped_insufficient_data
Logged Metrics:
db_connection_status: Database connection resultdepartment_records_fetched: Number of department samplesurgency_records_fetched: Number of urgency samplesdepartment_push_status: Upload status (success/failed/skipped)urgency_push_status: Upload statushf_space_pause: Auto-pause result
Alerts:
- 🔴 ERROR: Database connection failed
- 🔴 ERROR: Dataset preparation failed
- 🟡 WARN: Dataset skipped (insufficient data)
- 🟢 INFO: Dataset updated successfully
- 🟡 WARN: HF Space pause failed
Dashboard Views:
- Pipeline run history
- Dataset size trends over time
- Success/failure rate
- Average processing time
Successful Run Example:
Starting dataset preparation pipeline...
Created SQLAlchemy engine. Validating connection...
Database connection successful.
Fetching misclassified data for 'department'...
Retrieved 1523 records for 'department'.
Preprocessing and pushing 'department' dataset to HF Hub...
[INFO] Dataset successfully pushed to Hugging Face Hub: username/dept-dataset
README.md successfully uploaded for username/dept-dataset (v20251029_143052)
[INFO] Version tag created: v20251029_143052
Successfully pushed 'department' dataset.
Fetching misclassified data for 'urgency'...
Retrieved 980 records for 'urgency'.
Preprocessing and pushing 'urgency' dataset to HF Hub...
[INFO] Dataset successfully pushed to Hugging Face Hub: username/urgency-dataset
Successfully pushed 'urgency' dataset.
⏸ Attempting to pause Hugging Face Space...
Hugging Face Space paused successfully.
Dataset preparation completed successfully!
Skipped Dataset Example:
Fetching misclassified data for 'urgency'...
Retrieved 450 records for 'urgency'.
Skipped pushing 'urgency' dataset — insufficient data (450 < 1000).
[SKIPPED] Skipped pushing 'urgency' dataset...
Error:
Database connection failed after multiple attempts.
Solutions:
- ✅ Verify
POSTGRES_URLformat:postgresql://user:pass@host:port/db - ✅ Check database server is accessible from HF servers
- ✅ Whitelist Hugging Face IP ranges in firewall
- ✅ Test connection locally:
from sqlalchemy import create_engine engine = create_engine("your_postgres_url") with engine.connect() as conn: conn.exec_driver_sql("SELECT 1")
Error:
[ERROR] Failed to push dataset: 401 Client Error: Unauthorized
Solutions:
- ✅ Verify
HF_TOKENhas write permissions - ✅ Confirm dataset repository exists
- ✅ Check repository name matches environment variable
- ✅ Ensure token is not expired
Error:
wandb: ERROR API key is invalid
Solutions:
- ✅ Verify
WANDB_API_KEYis correct - ✅ Get new key from https://wandb.ai/settings
- ✅ Check WandB service status
Warning:
Skipped pushing 'department' dataset — insufficient data (450 < 1000).
Solutions:
- ✅ Lower
MIN_DATASET_LENthreshold - ✅ Collect more reviewed misclassifications
- ✅ Adjust
correct_ratioin code:df = fetch_misclassified_dataframe( label_column=label, engine=engine, correct_ratio=0.3 # Use 30% instead of 50% )
Warning:
Failed to pause HF Space: Space not found
Solutions:
- ✅ Verify
PREPARE_DATASET_SPACE_IDformat:username/space-name - ✅ Ensure token has Space management permissions
- ✅ This is non-critical - pipeline still completes
graph LR
A[Prepare Dataset Space] -->|Push Dataset| B[HF Dataset Hub]
B -->|Trigger Manually| C[Retrain Space]
C -->|Evaluate & Deploy| D[Inference Space]
D -->|Collect Feedback| E[PostgreSQL DB]
E -->|Fetch Misclassified| A
from huggingface_hub import HfApi
api = HfApi()
# After dataset preparation completes, trigger retraining
api.restart_space(
repo_id="your-username/urgency-classifier-retraining",
token=HF_TOKEN
)
api.restart_space(
repo_id="your-username/department-classifier-retraining",
token=HF_TOKEN
)WandB Projects Structure:
sambodhan-dataset-pipeline- Dataset preparation runssambodhan-urgency-classifier- Urgency model trainingsambodhan-department-classifier- Department model training
- Use HF tokens with minimum required permissions
- Rotate tokens regularly (quarterly)
- Never commit tokens to git
- Use HF Spaces secrets exclusively
- Use read-only database user if possible
- Never commit
POSTGRES_URLto git - Whitelist only necessary IP ranges
- Use SSL/TLS for database connections
- Rotate database passwords quarterly
- Ensure grievance data complies with privacy policies
- Use private repositories for sensitive data
- Implement data anonymization if needed
- Regular security audits of database access
This implementation provides a robust, automated, and scalable solution for continuous dataset preparation. The key advantages are:
- Automated: Direct database integration with retry logic
- Balanced: Smart sampling strategy (misclassified + correct)
- Versioned: Timestamp-based tags for full traceability
- Quality Tracked: WandB integration for monitoring
- Cost-Effective: Auto-pause saves compute costs
This architecture seamlessly integrates with the retraining pipeline to enable continuous model improvement.
- Hugging Face Org: https://huggingface.co/sambodhan
- Template Space: https://huggingface.co/spaces/sambodhan/prepare_dataset
Document Version: 1.0
Last Updated: October 29, 2025
Author: Based on implementation by @kushalregmi61
Status: Production-Ready ✅