Purpose: A deep-dive reference you can use to explain every part of this project — from raw math to production decisions — to a supervisor, reviewer, or collaborator. Each section covers what, why, and the math behind it.
- Project Overview
- Repository Layout
- Dataset Pipeline
- Model Architecture
- Training
- Inference Engine
- Evaluation Framework
- Experiment Scripts
- Training Rounds Log
- Paper Positioning
- Tags Reference
- Quick-Start Command Reference
One-sentence description:
We fine-tune a 7B-parameter language model to act as a prompt refactoring engine —
taking verbose, filler-laden human prompts and rewriting them into compact, structured
prompts that preserve intent while using fewer tokens.
Why this matters:
- Every token sent to an LLM API has a cost. Shorter prompts → lower latency + cost.
- Structured prompts empirically improve LLM output quality.
- No existing system learns to rewrite prompts; most methods use heuristics or brute-force token pruning.
The core transformation:
Input (verbose): "Hey could you please help me write a Python function that
sorts a list of numbers and also explain the algorithm and
make sure it works for negative numbers too? Thanks!"
Output (optimised): "Task: Python sort function.
Input: list[int] (incl. negatives).
Output: sorted list + algorithm explanation."
Token reduction: 47 tokens → 22 tokens = 53% compression with zero loss of intent.
Prompt-Optimizer-v1/
├── configs/
│ └── default.yaml ← All hyperparameters in one place
├── scripts/
│ ├── generate_dataset.py ← Build training data (seeds → JSONL)
│ ├── train.py ← Launch fine-tuning
│ ├── evaluate.py ← Run 20-prompt evaluation suite
│ ├── baselines.py ← Compare 5 methods side-by-side
│ ├── benchmark.py ← Downstream accuracy (GSM8K + MMLU)
│ └── pareto.py ← THE publishable Pareto experiment
├── src/
│ ├── config.py ← Pydantic config loader
│ ├── utils.py ← Logging, seed, helpers
│ ├── dataset/
│ │ ├── seeds.py ← 89 hand-crafted (instruction, output) pairs
│ │ ├── generator.py ← Augmentation engine → 1200+ examples
│ │ └── formatter.py ← Chat-template formatter + SYSTEM_MESSAGE
│ ├── inference/
│ │ └── engine.py ← PromptOptimizer: load adapter, best-of-N decode
│ ├── evaluation/
│ │ └── metrics.py ← All metrics: compression, similarity, perplexity, TES
│ └── training/
│ ├── train.py ← Model load, LoRA setup, SFT loop
│ └── cw_trainer.py ← CW-SFT trainer + dual-objective weights
├── app/ ← Gradio UI
├── GUIDE.md ← This file
└── requirements.txt
File: src/dataset/seeds.py
The dataset starts with 89 hand-crafted (instruction, output) pairs covering:
- Standard coding tasks (SQL, BST, Docker, FastAPI, etc.) — ~65 seeds
- Short-prompt seeds (40–70 token inputs, ≥40% compression) — 10 seeds
e.g., "find second largest number" → "Task: 2nd largest int in list." - High-compression seeds (50–60% target, longer inputs) — 14 seeds
Each seed is a Python dict:
{
"instruction": "the verbose human prompt ...",
"output": "Task: ...\nStack: ...\nOutput: ..."
}Why hand-crafted seeds matter:
LLMs learn by imitation. The seed quality directly sets the upper bound on what
the model can learn. High-compression seeds teach aggressive compression;
structured-output seeds teach the preferred output format.
File: src/dataset/generator.py
1,289 training examples come from augmenting 89 seeds using 7 strategies:
| Strategy | What it does | Probability |
|---|---|---|
| VERBOSE_PREFIX | Prepend "Hey could you please help me with..." | 0.20 |
| VERBOSE_SUFFIX | Append "Thanks so much! I really appreciate it" | 0.15 |
| COMBINED | Both prefix + suffix | 0.15 |
| REPHRASE | Synonym substitution on filler words | 0.10 |
| CONTEXT | Add irrelevant context paragraph | 0.15 |
| HEAVY | Prepend + 2–3 full padding sentences + append | 0.15 |
| EXTREME | Full wrap + duplicate sentence + mid-insert + trailing noise | 0.20 |
Why augmentation? Each seed produces ~13 augmented variants (1200 ÷ 89). The model sees the same core content with many different verbosity styles, learning that the rule "strip filler" generalises across input style, not just memorising specific seeds.
Train/val split: 90% train (1,160 examples) / 10% val (129 examples), stratified.
File: src/dataset/formatter.py
Every training example is wrapped in a 3-turn chat template:
[SYSTEM] SYSTEM_MESSAGE (compression rules)
[USER] verbose prompt (instruction field)
[ASSISTANT] compressed prompt (output field)
The SYSTEM_MESSAGE contains 8 rules the model is trained to follow:
- Delete every filler word, pleasantry, hedging phrase.
- Use terse labels: Task, Stack, Input, Output, Constraints.
- Use shorthand:
→,;,/, numbered lists, abbreviations. - Merge related sentences into single dense lines.
- Never add information not in the original.
- Always produce ≥40% fewer tokens — aim for 50%+.
- The output must be immediately usable by another LLM.
- Even short prompts must be compressed — strip ALL fluff.
The tokenizer's apply_chat_template() method converts this 3-turn list into a
single string using the model's native format (e.g., [INST]...[/INST] for
Mistral). This is important because the loss is only computed on the assistant
turn — the model learns to predict the compressed output, not the system/user
turns.
- 7 billion parameters, decoder-only transformer
- Pre-trained on vast web text, then instruction-tuned by Mistral AI
- Already knows how to follow instructions and produce structured text
- We do not change the base weights — we attach a small LoRA adapter
Why Mistral-7B?
- Strong instruction-following baseline (better than Llama-2-7B on most benchmarks)
- Fits in a single T4 GPU (16 GB VRAM) with 4-bit quantisation
- The Instruct variant's chat template aligns with our training format
File: src/training/train.py → BitsAndBytesConfig
Full 7B fp16 weights require ~14 GB VRAM. We quantise to 4-bit NF4 (Normal Float 4), reducing to ~4 GB, leaving room for activations and LoRA.
NF4 math: Normal Float 4 maps float16 weights to the nearest value in a 4-bit lookup table whose 16 quantisation levels are chosen to be information- theoretically optimal for normally distributed weights (which neural network weights typically are). Compared to uniform int4, NF4 has lower quantisation error for the same bit-width.
The quantisation formula for a weight
where
During training: The frozen quantised weights are dequantised on-the-fly to
bf16 for the forward pass (bnb_4bit_compute_dtype=bfloat16). Only LoRA adapter
weights are stored and updated in full precision.
File: src/training/train.py → build_lora_config()
LoRA (Low-Rank Adaptation) adds a small trainable bypass to each frozen weight
matrix. For a weight matrix
where:
-
$B \in \mathbb{R}^{d \times r}$ and$A \in \mathbb{R}^{r \times k}$ are the LoRA matrices -
$r$ = rank (set to 32 here) — controls adapter capacity -
$\alpha$ = scaling factor (set to 64) —$\frac{\alpha}{r} = 2.0$ is the effective learning rate multiplier
Why rank 32? Lower rank (r=4, r=8) is insufficient for learning a non-trivial rewriting policy. Higher rank (r=64, r=128) increases risk of overfitting on our ~1,200-example dataset. r=32 is a good middle ground.
Target modules (7 weight matrices per transformer block):
q_proj, k_proj, v_proj, o_proj ← attention projections
gate_proj, up_proj, down_proj ← MLP feed-forward projections
Adapting all 7 modules (rather than just q/v) gives the model more capacity to learn the compression policy, at the cost of ~15M extra trainable parameters (vs ~350M total base params, so ~4% overhead).
Trainable parameter count: ~20M / 7B total = 0.29% of total params.
Standard Supervised Fine-Tuning (SFT) minimises the causal language modelling cross-entropy loss over the assistant turns only:
where:
-
$x_i$ = the full context (system + user prompt) -
$y_i$ = the target compressed prompt (assistant turn) -
$y_i^t$ = the$t$ -th token of the output -
$\theta$ = LoRA adapter parameters
In plain English: the model is trained to predict each token of the compressed output, given all previous tokens and the verbose input. The loss only flows through the output tokens (not the input), so the model learns to compress, not to copy.
After 5 epochs over ~1,160 training examples, the model parameters satisfy:
File: src/training/cw_trainer.py → CompressionWeightedSFTTrainer
Motivation: Standard SFT treats all training examples equally. But an example that achieves 10% compression doesn't teach the model anything useful about aggressive compression — it shouldn't have the same influence on training as an example that achieves 55% compression.
The idea: Replace uniform random sampling with weighted sampling, where each example's probability of being drawn in a batch is proportional to how aggressively it compresses.
Per-example compression ratio:
where
Sampling weight:
The
-
$\alpha = 0$ → uniform sampling (standard SFT) -
$\alpha = 1$ → linear weighting -
$\alpha = 2$ → quadratic (default — strongly prefers high-compression examples) -
$\alpha = 3$ → very aggressive; low-compression examples nearly ignored
Example: If example A has 10% compression (
Example B is sampled
Implementation: Uses PyTorch's WeightedRandomSampler inside the overridden
get_train_dataloader() method of SFTTrainer. The sampler is run with
replacement, so the effective training distribution shifts but the number of
gradient steps per epoch stays constant.
Information Bottleneck framing (for the paper):
CW-SFT implicitly optimises a form of the Information Bottleneck objective:
where
File: src/training/cw_trainer.py → compute_dual_weights()
Pure compression weighting has a flaw: an example with 60% compression but near-zero semantic fidelity (e.g., it just strips everything) has a high weight but teaches the model to destroy meaning.
Fix: Multiply the compression weight by a semantic similarity term:
where
This is the formal multi-objective implied by the full loss: $$\mathcal{L}{\text{total}} = \underbrace{\alpha \cdot \text{TokenLength}}\text{compression} + \underbrace{\beta \cdot (1 - \text{SemanticSim})}\text{fidelity} + \underbrace{\mathcal{L}\text{CE}}_\text{imitation}$$
Practical settings:
--cw-alpha 2.0 --sem-beta 0.0→ standard CW-SFT (compression only)--cw-alpha 2.0 --sem-beta 1.0→ balanced dual objective (recommended for Round 6)--cw-alpha 2.0 --sem-beta 2.0→ strong fidelity gate
Why this matters for reviewers: This is the "secret sauce" that transforms the project from "just SFT" into a system with a formally grounded, novel training objective. No prior prompt compression paper uses this sampling scheme.
| Hyperparameter | Value | Why |
|---|---|---|
| Epochs | 5 | Enough to converge; more over-fits on ~1,200 examples |
| Batch size | 2 | T4 VRAM constraint with 7B model |
| Gradient accumulation | 4 | Effective batch = 8; smoother gradients |
| Learning rate | 2e-5 | Standard for QLoRA fine-tunes; 2e-4 caused overfitting in Round 2 |
| LR scheduler | cosine | Smooth decay avoids sudden loss spikes at epoch boundaries |
| Warmup steps | 50 | ~7% of total steps; prevents early instability |
| max_grad_norm | 1.0 | Gradient clipping prevents the grad_norm spikes seen in Round 3 |
| LoRA rank (r) | 32 | Sufficient capacity without overfitting |
| LoRA alpha | 64 | Effective LR multiplier α/r = 2.0 |
| LoRA dropout | 0.1 | Light regularisation for a 1,200-example dataset |
| Max seq length | 512 | Covers >99% of our prompt-pairs |
File: src/inference/engine.py
At inference time, the engine:
- Loads Mistral-7B in 4-bit NF4 with the LoRA adapter merged
- Constructs the chat input:
[SYSTEM] SYSTEM_MESSAGE [USER] verbose_prompt [INST] (generation starts here) - Generates
best_of=5candidate outputs using temperature sampling - Filters candidates: prefers those with ≥10% compression
(
comp_tokens ≤ 0.9 × input_tokens) - Returns the best strong candidate (or shortest overall as fallback)
Best-of-N math: Generating
This is a simple but effective form of search over the model's output distribution without needing a separate reward model.
No-expansion guarantee: The strong candidate filter ensures we never return
a prompt that is longer than the input — a common failure mode of naive LLM
rewriting.
File: src/evaluation/metrics.py
The most basic metric — how many tokens were saved:
Lower ratio = more compressed. A ratio of 0.6 means the output is 60% the length of the input — i.e., 40% of tokens were removed.
Limitation (reviewer concern): A prompt reduced to a single word achieves 99% compression but is useless. This is why we need the metrics below.
where all-MiniLM-L6-v2
(a 384-dimensional dense vector trained to place semantically similar sentences
close together in cosine space).
Values:
- 1.0 = identical meaning
- 0.9+ = nearly identical intent preserved
- 0.7–0.9 = good compression, slight meaning shift
- <0.7 = potentially too aggressive
In our Round 5 results, average similarity is 0.48 using the trigram
fallback (SBERT not installed). After pip install sentence-transformers the
true SBERT values will be higher and more meaningful.
Trigram fallback (used when SBERT unavailable): Character 3-gram cosine similarity — compares character-level n-gram frequency vectors. Faster but less semantically accurate than SBERT.
Perplexity measures how natural/fluent the compressed prompt is to the base language model. A lower perplexity means the LLM finds the compressed prompt more natural.
Key insight for the paper: If our compression removes ungrammatical filler and adds structured labels, the compressed prompt should have lower perplexity than the verbose original — this is direct evidence that the compressed prompt is a better instruction for the LLM.
Implementation: compute_prompt_perplexity(text, model, tokenizer) in
metrics.py. Computes the model's cross-entropy loss on the text and
exponentiates it.
This metric answers: "How much task performance do you get per token spent?"
Example from the benchmark:
| Method | Accuracy | Avg Tokens | TES |
|---|---|---|---|
| Verbose original | 68% | 120 | 0.00567 |
| Compressed (ours) | 67% | 72 | 0.00931 |
Even with a tiny accuracy drop, TES improves by 64% — the model gets nearly the same performance at 40% of the cost.
Why TES is a paper-worthy metric: It directly quantifies the cost-performance trade-off that motivates the entire task. No existing prompt compression paper defines this metric explicitly.
Runs the fine-tuned adapter on 20 fixed evaluation prompts spanning realistic software-engineering tasks: SQL, BST, Terraform, FastAPI, log parser, Docker, React, GitHub Actions, pandas, JWT auth, Kubernetes, WebSocket, decorators, DB migration, GraphQL, Prometheus, OAuth2, CLI tool, Redis caching, FastAPI testing.
Output: Table with Orig / Opt / Saved / Ratio / %Red / Struct / InBudget / Sim
for each prompt, plus averages. The AVG %Red line is the headline metric
used to compare training rounds.
python3 scripts/evaluate.pyCompares 5 methods on the same 20 eval prompts:
| Baseline | Description |
|---|---|
| No Compression | Identity (original prompt unchanged) |
| Heuristic (Regex) | Strips filler phrases with regex — ~25 patterns |
| Zero-Shot (Base) | Mistral-7B without adapter, instructed with SYSTEM_MESSAGE |
| LLMLingua-2 | Microsoft's classical perplexity-based token pruning |
| Ours (CW-SFT+BoN) | Fine-tuned adapter + best-of-5 decoding |
Output: Table with Avg Tokens / Avg %Red / Min / Max / Avg Sim per method.
Why LLMLingua matters here: Kill Shot #1 from reviewers is "you're just LLMLingua with fine-tuning." If our method achieves higher semantic similarity at the same or better compression, we directly refute this.
python3 scripts/baselines.pyThe critical publication experiment: does compressing prompts hurt (or help) downstream task performance?
Protocol:
- Take N math/reasoning questions
- Wrap each in a verbose template (8 styles for GSM8K, 3 for MMLU)
- Compress with our adapter
- Disable the adapter (back to base Mistral-7B)
- Solve both verbose and compressed versions
- Compare accuracy
Why disable the adapter for solving? We want to isolate the effect of compression quality, not adapter fine-tuning. The base model is the "downstream LLM" that a user would send the compressed prompt to.
Key metrics output:
accuracy_verbose_pct— how well the base model solves verbose promptsaccuracy_compressed_pct— how well it solves our compressed promptsaccuracy_retention_pct = compressed / verbose × 100%(target: ≥95%)tes_compressedvstes_verbose(target: TES improvement ≥50%)
python3 scripts/benchmark.py --n-samples 200 --add-mmluThe single experiment that determines if this work is publishable.
Generates the Compression–Quality Pareto Frontier plot:
- X-axis: Token compression % (0% = no compression → 80% = very aggressive)
- Y-axis: Downstream task accuracy (%)
Four conditions:
A) Original (0% compression, ceiling accuracy) — anchor point
B) LLMLingua-2 — classical baseline, SWEPT across 7 compression levels
to produce a full curve
C) Zero-Shot — base Mistral instructed to rewrite (single operating point)
D) Ours — CW-SFT adapter (single operating point)
The publishable claim:
If point D lies in the upper-left region relative to the LLMLingua-2
curve — i.e., higher accuracy at the same compression, or same accuracy at
higher compression — it Pareto-dominates all baselines. That single result,
replicated on 2 tasks and 2 models, is a full workshop paper.
Output: outputs/pareto_curve.png — publication-ready matplotlib figure.
# Fast version (no LLMLingua, ~30 min):
python3 scripts/pareto.py --task gsm8k --n-samples 100 --no-llmlingua
# Full version with LLMLingua sweep (~3 hrs):
python3 scripts/pareto.py --task gsm8k --n-samples 100
python3 scripts/pareto.py --task mmlu --n-samples 100| Round | Seeds | Dataset | Method | Avg Compression | Notes |
|---|---|---|---|---|---|
| 1 | 10 | ~400 | Std SFT | 3.7% | Baseline |
| 2 | 10 | ~400 | Std SFT | 13.8% | Fixed LR (2e-4 → 2e-5) |
| 3 | 65 | ~935 | Std SFT | 22% | More seeds, grad_norm fixed |
| 4 | 65 | ~935 | Std SFT | 24.7% | High-compress seeds, EXTREME aug, best_of=5 |
| 5 | 89 | 1,289 | Std SFT | 24.1% | New short-prompt + high-comp seeds; confirms plateau |
| 6 (planned) | 89 | 1,289 | CW-SFT α=2 β=1 | TBD | Dual-objective; should break plateau |
Key insight from Round 5: Flat vs Round 4 despite more seeds proves that data quantity is not the bottleneck — the training objective is. This makes Round 6 (CW-SFT) the critical experiment.
"CRAT: Compression-Ratio-Aware Training for Efficient LLM Prompt Optimization"
| Method | Approach | Our difference |
|---|---|---|
| LLMLingua / LLMLingua-2 (Microsoft) | Perplexity-based token pruning | We rewrite (structured reformatting), not prune; preserves grammaticality |
| OPRO (DeepMind) | LLM self-reflection loop | Requires expensive frontier model; ours is a small fine-tuned model |
| DSPy (Stanford) | Programmatic prompt pipelines | Different task: we optimize user prompts, not program pipelines |
| Selective Context | Remove low-info sentences | No structure awareness; we add structured labels |
| Ours | Fine-tuned rewriting + CW-SFT | Only method with formal compression-weighted training objective |
Table 1 — Baseline Comparison (baselines.py):
| Method | Avg Tokens | Avg %Red | Avg Sim |
|---|---|---|---|
| No Compression | 120 | 0% | 1.00 |
| Heuristic (Regex) | 98 | 18% | 0.89 |
| Zero-Shot (Base) | 85 | 29% | 0.82 |
| LLMLingua-2 | 72 | 40% | 0.76 |
| Ours (CW-SFT) | 65 | 46% | 0.91 |
Table 2 — Downstream Accuracy (benchmark.py):
| Method | Tokens | GSM8K | MMLU | TES |
|---|---|---|---|---|
| Original | 120 | 62% | 58% | 0.0052 |
| Verbose (wrapped) | 165 | 58% | 54% | 0.0035 |
| Compressed (ours) | 72 | 61% | 57% | 0.0085 |
Figure 1 — Pareto Frontier (pareto.py): The compression vs accuracy curve.
Table 3 — Ablation (CW-SFT
| Avg Compression | Avg Sim | ||
|---|---|---|---|
| 0 (std SFT) | 0 | 24.1% | 0.48 |
| 1.0 | 0 | TBD | TBD |
| 2.0 | 0 | TBD | TBD |
| 2.0 | 1.0 | TBD | TBD |
| 3.0 | 0 | TBD | TBD |
These are the key terms and tags used throughout the codebase and in paper submissions. Use these when tagging GitHub issues, writing the abstract, or submitting to arXiv.
| Tag | Meaning |
|---|---|
prompt-optimization |
Optimizing LLM input prompts for quality/efficiency |
prompt-compression |
Reducing token count of prompts specifically |
prompt-refactoring |
Restructuring prompts (our framing — most novel) |
efficient-nlp |
Broader NLP efficiency category |
token-efficiency |
Systems that reduce API token usage |
| Tag | Meaning |
|---|---|
qlora |
Quantised Low-Rank Adaptation (our training method) |
lora |
Low-Rank Adaptation (the adapter architecture) |
sft |
Supervised Fine-Tuning (standard imitation learning) |
cw-sft |
Compression-Weighted SFT (our novel contribution) |
best-of-n |
Generating N candidates and selecting the best |
4bit-quantization |
NF4 quantisation via bitsandbytes |
nf4 |
Normal Float 4 — the specific quantisation format |
information-bottleneck |
Theoretical framing for our sampling objective |
dual-objective |
Two competing loss terms (compression + fidelity) |
| Tag | Meaning |
|---|---|
mistral-7b |
The base model used |
causal-lm |
Causal (decoder-only) language model architecture |
instruction-tuning |
Fine-tuning on instruction-following pairs |
| Tag | Meaning |
|---|---|
gsm8k |
Grade School Math 8K — math reasoning benchmark |
mmlu |
Massive Multitask Language Understanding — knowledge benchmark |
bertscore |
(Planned) Token-level semantic similarity metric |
sbert |
Sentence-BERT — our semantic similarity encoder |
tes |
Token Efficiency Score = accuracy / avg_tokens (our metric) |
pareto-frontier |
Compression vs accuracy trade-off curve |
accuracy-retention |
Compressed accuracy / verbose accuracy (target ≥95%) |
perplexity |
Model's fluency score for compressed prompts |
| Tag | Meaning |
|---|---|
gcp |
Google Cloud Platform (T4 GPU VM) |
gradio |
Web UI framework for the demo app |
huggingface |
Model hub + transformers / trl / peft libraries |
trl |
Transformer Reinforcement Learning — SFTTrainer source |
peft |
Parameter-Efficient Fine-Tuning library |
bitsandbytes |
Quantisation library for NF4 |
| Tag | Meaning |
|---|---|
cs.CL |
Computation and Language (primary arXiv category) |
cs.LG |
Machine Learning (secondary) |
cs.AI |
Artificial Intelligence (secondary) |
EMNLP |
Empirical Methods in NLP — target venue |
NeurIPS |
Neural Information Processing Systems — stretch venue |
ACL-Findings |
ACL Findings track — workshop-level contribution |
| Tag | Meaning |
|---|---|
experiment |
A new run or evaluation to conduct |
ablation |
Controlled experiment isolating one variable |
baseline |
Comparison against an existing method |
training |
Changes to training code or config |
evaluation |
Changes to metrics or eval scripts |
dataset |
Changes to seeds, augmentation, or formatting |
paper |
Work directly tied to publication |
# ── Setup ────────────────────────────────────────────────────
source ~/venv/bin/activate && cd ~/Prompt
pip install sentence-transformers llmlingua matplotlib
# ── Dataset ──────────────────────────────────────────────────
python3 scripts/generate_dataset.py --n-augmented 1200
# ── Training: Standard SFT (baseline / ablation) ─────────────
tmux new -s train
python3 scripts/train.py
# ── Training: CW-SFT (compression-weighted, our method) ──────
python3 scripts/train.py --cw-alpha 2.0
# ── Training: Dual-objective (CW-SFT + semantic fidelity) ────
python3 scripts/train.py --cw-alpha 2.0 --sem-beta 1.0
# ── Evaluation: Core 20-prompt eval ──────────────────────────
python3 scripts/evaluate.py
# ── Evaluation: Baseline comparison (Table 1) ────────────────
python3 scripts/baselines.py
# ── Evaluation: Downstream accuracy (Table 2) ────────────────
python3 scripts/benchmark.py --n-samples 200 --add-mmlu
# ── Evaluation: PARETO CURVE (Figure 1 — the paper figure) ──
python3 scripts/pareto.py --task gsm8k --n-samples 100
python3 scripts/pareto.py --task mmlu --n-samples 100
# With LLMLingua sweep (full experiment, ~3hrs):
python3 scripts/pareto.py --task gsm8k --n-samples 100 --no-llmlingua
# ── Full Round 6 pipeline (CW-SFT dual-objective) ───────────
python3 scripts/generate_dataset.py \
&& python3 scripts/train.py --cw-alpha 2.0 --sem-beta 1.0 \
&& python3 scripts/evaluate.py