Fine-tune a local LLM to transform sloppy, verbose user prompts into concise, token-compressed, structured prompts that produce better outputs across any domain.
prompt-optimizer/
├── configs/
│ └── default.yaml # Central configuration
├── src/
│ ├── config.py # Dataclass-based config + YAML loader
│ ├── utils/ # Logging setup
│ ├── dataset/
│ │ ├── seeds.py # High-quality seed examples
│ │ ├── generator.py # Synthetic dataset pipeline
│ │ └── formatter.py # Chat-template formatting
│ ├── training/
│ │ └── train.py # LoRA / QLoRA fine-tuning
│ ├── inference/
│ │ └── engine.py # Model loading + generation
│ └── evaluation/
│ └── metrics.py # Token counting + quality heuristics
├── app/
│ └── ui.py # Gradio UI
├── scripts/
│ ├── generate_dataset.py # Dataset generation entrypoint
│ ├── train.py # Training entrypoint
│ ├── infer.py # CLI inference entrypoint
│ ├── evaluate.py # Evaluation entrypoint
│ └── launch_ui.py # UI launch entrypoint
├── data/ # Generated datasets
├── outputs/ # Checkpoints, adapters
├── requirements.txt
├── .gitignore
└── README.md
pip install -r requirements.txtColab: The same command works in a Colab cell. Add
!prefix :!pip install -r requirements.txt
python scripts/generate_dataset.pyThis creates data/processed/train.jsonl and data/processed/val.jsonl using built-in seed examples plus synthetic augmentations.
To include your own data:
python scripts/generate_dataset.py --external data/raw/my_data.jsonlpython scripts/train.pyTraining uses QLoRA with 4-bit quantisation by default. The LoRA adapter is saved to outputs/adapter/.
To resume from a checkpoint:
python scripts/train.py --resume outputs/checkpoints/checkpoint-100Single prompt:
python scripts/infer.py "Hey can you please help me write a Python function that ..."Interactive mode:
python scripts/infer.py --interactiveTest samples (no prompt needed):
python scripts/infer.pypython scripts/evaluate.pyPrints a table with token counts, compression ratios, and quality heuristics.
python scripts/launch_ui.pyOpens a Gradio app at http://localhost:7860. To create a public link (useful on Colab):
python scripts/launch_ui.py --shareAll settings live in configs/default.yaml. Key sections:
| Section | Controls |
|---|---|
model |
Base model name (swap between Mistral / Llama) |
quantization |
4-bit BitsAndBytes settings |
lora |
Rank, alpha, dropout, target modules |
training |
Epochs, batch size, LR, scheduler, etc. |
generation |
max_new_tokens, temperature, top_p, etc. |
dataset |
Paths, val split ratio, sample limits |
evaluation |
Token threshold, min compression ratio |
ui |
Server host, port, share flag |
| Model | Min VRAM (4-bit) | Notes |
|---|---|---|
mistralai/Mistral-7B-Instruct-v0.2 |
~6 GB | Good default, fast |
meta-llama/Meta-Llama-3-8B-Instruct |
~7 GB | Strong instruction following |
TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
~2 GB | Ultra-light for testing |
Change the model in configs/default.yaml under model.name.
- Use a T4 GPU runtime (free tier).
- Clone the repo and
pip install -r requirements.txt. - Run
scripts/train.py— the defaults are tuned for low VRAM. - Use
--sharewith the UI to get a public URL. - If VRAM is tight, reduce
training.max_seq_lengthortraining.per_device_train_batch_sizein the config.
The dataset formatter uses each model's native apply_chat_template() method, ensuring training data is correctly formatted for Llama-3-Instruct, Mistral-Instruct, and similar chat models. A plain-text fallback is included for tokenizers without a built-in chat template.
Generation is capped via generation.max_new_tokens (default 512) to prevent cut-off mid-output. Adjust this in the config if your use case requires longer compressed prompts.