🛡️ SecLM — Fine-Tuned LLM for Log Analysis & Threat Detection

LoRA fine-tuned Qwen3-8B for security log analysis, threat detection, anomaly classification, and incident triage.

SecLM transforms a general-purpose LLM into a cybersecurity analyst that understands syslog, firewall logs, cloud audit trails, WAF events, EDR telemetry, and SIEM alerts — and can classify threats, explain attack patterns, and recommend response actions.

🎯 What SecLM Does

Capability	Description
Log Classification	Classifies raw log entries as benign, suspicious, or malicious
Threat Detection	Identifies attack patterns — brute force, lateral movement, C2 callbacks, data exfiltration, privilege escalation
MITRE ATT&CK Mapping	Maps detected threats to MITRE ATT&CK tactics, techniques, and sub-techniques
Anomaly Explanation	Explains why a log entry or sequence is anomalous in plain language
Incident Triage	Assigns severity (Critical / High / Medium / Low / Info) with reasoning
IOC Extraction	Extracts indicators of compromise — IPs, domains, hashes, file paths, user agents
Response Recommendations	Suggests containment and remediation actions based on the detected threat
Multi-Source Analysis	Understands logs from Linux syslog, Windows Event Logs, AWS CloudTrail, firewall/IDS, WAF, Kubernetes audit, and more

🔍 Example

Input:

Analyze this log entry:
Mar 15 03:42:18 web-prod-03 sshd[28451]: Failed password for root from 185.220.101.42 port 44831 ssh2
Mar 15 03:42:19 web-prod-03 sshd[28453]: Failed password for root from 185.220.101.42 port 44832 ssh2
Mar 15 03:42:19 web-prod-03 sshd[28455]: Failed password for admin from 185.220.101.42 port 44833 ssh2
Mar 15 03:42:20 web-prod-03 sshd[28457]: Accepted password for admin from 185.220.101.42 port 44834 ssh2

SecLM Output:

Classification: MALICIOUS
Severity: CRITICAL
Attack Type: SSH Brute Force → Successful Compromise

MITRE ATT&CK:
- Tactic: Initial Access (TA0001)
- Technique: Brute Force (T1110)
- Sub-technique: Password Guessing (T1110.001)

Analysis:
Rapid sequential failed SSH login attempts targeting multiple accounts (root, admin)
from a single source IP 185.220.101.42, followed by a successful authentication
for the 'admin' account. The 1-second interval between attempts indicates automated
tooling. The source IP is associated with known Tor exit nodes.

IOCs:
- Source IP: 185.220.101.42 (Tor exit node)
- Targeted accounts: root, admin
- Target host: web-prod-03

Recommended Actions:
1. IMMEDIATE: Disable the 'admin' account on web-prod-03
2. IMMEDIATE: Block 185.220.101.42 at firewall level
3. INVESTIGATE: Check for post-compromise activity — new processes, cron jobs,
   SSH keys added, lateral movement attempts
4. HARDEN: Disable password authentication, enforce key-based SSH only
5. MONITOR: Enable enhanced logging on web-prod-03 for the next 72 hours

🏗️ Architecture

┌─────────────────────────────────────────────────────┐
│                    SecLM Pipeline                     │
├─────────────────────────────────────────────────────┤
│                                                      │
│   Raw Logs ──→ Preprocessing ──→ SecLM (Qwen3-8B   │
│   (syslog,      (normalize,       + LoRA adapter)    │
│    CloudTrail,   chunk,                              │
│    WAF, EDR,     enrich)       ──→ Structured Output │
│    k8s audit)                      (classification,  │
│                                     MITRE mapping,   │
│                                     IOCs, actions)   │
│                                                      │
└─────────────────────────────────────────────────────┘

Base Model: Qwen3-8B (Apache 2.0) Fine-Tuning Method: LoRA (rank 16–64) / QLoRA for budget GPUs Training Data: Security log datasets with expert-labeled threat classifications

📋 Supported Log Sources

Source	Log Types
Linux	syslog, auth.log, kern.log, journald
Windows	Security Event Log (4624, 4625, 4688, 4720, etc.), PowerShell logs
Cloud — AWS	CloudTrail, VPC Flow Logs, GuardDuty findings, WAF logs
Cloud — Azure	Activity Log, NSG Flow Logs, Entra ID sign-in logs
Cloud — GCP	Cloud Audit Logs, VPC Flow Logs
Network	Firewall logs (iptables, pf, Palo Alto, Fortinet), IDS/IPS (Snort, Suricata)
Web	Apache/Nginx access logs, WAF logs (ModSecurity, AWS WAF, Cloudflare)
Kubernetes	Audit logs, Falco alerts, OPA/Gatekeeper denials
Endpoint	EDR telemetry (CrowdStrike, SentinelOne, Wazuh)
SIEM	Pre-parsed alerts from Splunk, Elastic SIEM, QRadar

🔥 Threat Categories Covered

Category	Examples
Brute Force & Credential Attacks	SSH brute force, password spraying, credential stuffing
Lateral Movement	Pass-the-hash, RDP pivoting, SMB relay
Privilege Escalation	sudo abuse, token manipulation, kernel exploits
Command & Control	Beacon callbacks, DNS tunneling, encoded PowerShell
Data Exfiltration	Unusual outbound transfers, DNS exfil, cloud storage abuse
Web Attacks	SQLi, XSS, RFI/LFI, path traversal, SSRF
Persistence	Cron jobs, systemd services, registry run keys, SSH authorized_keys
Reconnaissance	Port scanning, service enumeration, directory brute forcing
Cloud-Specific	IAM key abuse, S3 bucket exposure, security group modifications
Container/K8s	Privileged container escape, RBAC bypass, crypto mining pods

📦 Dataset Format

Training data follows a structured instruction format with security-specific fields:

[
  {
    "instruction": "You are SecLM, a cybersecurity log analysis expert. Analyze the following log entry and provide: classification (BENIGN/SUSPICIOUS/MALICIOUS), severity, attack type, MITRE ATT&CK mapping, analysis, IOCs, and recommended actions.",
    "input": "Mar 15 03:42:18 web-prod-03 sshd[28451]: Failed password for root from 185.220.101.42 port 44831 ssh2\nMar 15 03:42:19 web-prod-03 sshd[28453]: Failed password for root from 185.220.101.42 port 44832 ssh2",
    "output": "Classification: MALICIOUS\nSeverity: HIGH\nAttack Type: SSH Brute Force\n\nMITRE ATT&CK:\n- Tactic: Initial Access (TA0001)\n- Technique: Brute Force (T1110.001)\n\nAnalysis: Rapid repeated SSH authentication failures targeting root account from a single source IP with sequential ports indicates automated brute force attack.\n\nIOCs:\n- Source IP: 185.220.101.42\n- Target: root@web-prod-03\n\nRecommended Actions:\n1. Block source IP at firewall\n2. Verify root login is disabled in sshd_config\n3. Check if any attempt succeeded"
  }
]

Dataset Sources for Training

Source	Type	Description
CICIDS2017	Network IDS	Labeled network traffic with multiple attack types
CSE-CIC-IDS2018	Network IDS	Updated version with more attack scenarios
LANL Unified Host and Network	Auth + Network	58 days of auth events from Los Alamos National Lab
Splunk Boss of the SOC (BOTS)	SIEM	Realistic SOC scenario datasets
Custom labeled logs	Various	Your own labeled production logs (recommended)
Synthetic generation	Various	LLM-generated log scenarios for data augmentation

📋 Requirements

Component	Minimum (QLoRA)	Recommended (LoRA)
GPU VRAM	10 GB	24 GB
System RAM	16 GB	32 GB
Disk	30 GB	50 GB
Python	3.10	3.11
CUDA	12.1	12.4

Tested GPUs: NVIDIA A10 (24GB), RTX 4090, RTX 3090, A100

🚀 Quick Start

# Clone
git clone https://github.com/<your-username>/seclm-log-threat-detection.git
cd seclm-log-threat-detection

# Install
pip install -r requirements.txt
pip install flash-attn --no-build-isolation  # optional, recommended

# Test with example security logs
python train.py

# Train with your labeled log dataset
python train.py --dataset data/your_labeled_logs.json --epochs 3

# QLoRA mode (budget GPUs — RTX 3090, T4)
python train.py --dataset data/your_labeled_logs.json --use_4bit

# Higher capacity for complex threat detection
python train.py --dataset data/your_labeled_logs.json --lora_rank 64 --max_seq_length 2048

⚙️ Recommended Training Configs

Use Case	GPU	Command
Quick test	Any 24GB	`python train.py`
Production (A10)	A10 24GB	`python train.py --dataset data/logs.json --lora_rank 32 --epochs 3`
Budget training	T4 16GB	`python train.py --dataset data/logs.json --use_4bit --batch_size 1 --max_seq_length 512`
Max quality	A100 80GB	`python train.py --dataset data/logs.json --lora_rank 64 --batch_size 4 --max_seq_length 2048`

💾 VRAM Usage

Config	VRAM	Notes
LoRA fp16, rank 16	~18-20 GB	Default, good for most log analysis tasks
LoRA fp16, rank 64	~20-22 GB	Better for complex multi-source correlation
QLoRA 4-bit, rank 16	~10-12 GB	Budget option, slight quality tradeoff
QLoRA 4-bit, rank 64	~12-14 GB	Best quality on budget hardware

🎯 Inference & Deployment

Load Fine-Tuned Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", device_map="auto", torch_dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(base_model, "./output/<run_name>/final")
tokenizer = AutoTokenizer.from_pretrained("./output/<run_name>/final")

Merge for Production Deployment

merged = model.merge_and_unload()
merged.save_pretrained("./seclm-production")
tokenizer.save_pretrained("./seclm-production")
# Deploy with vLLM, TGI, or Ollama — no PEFT dependency needed

Integration with SIEM / SOC Pipeline

┌──────────┐     ┌──────────────┐     ┌───────────┐     ┌──────────────┐
│  Log      │────→│  Preprocessing│────→│  SecLM    │────→│  Alert /     │
│  Sources  │     │  & Batching   │     │  Inference│     │  SOAR Action │
│           │     │              │     │  (vLLM)   │     │              │
│ • Syslog  │     │ • Normalize  │     │           │     │ • PagerDuty  │
│ • Cloud   │     │ • Chunk      │     │ Returns:  │     │ • Slack      │
│ • WAF     │     │ • Filter     │     │ • Class   │     │ • JIRA       │
│ • EDR     │     │   noise      │     │ • MITRE   │     │ • Block IP   │
│ • K8s     │     │              │     │ • IOCs    │     │ • Isolate    │
└──────────┘     └──────────────┘     └───────────┘     └──────────────┘

📁 Project Structure

seclm-log-threat-detection/
├── train.py                     # Main fine-tuning script
├── requirements.txt             # Python dependencies
├── README.md                    # This file
├── LICENSE                      # Apache 2.0
├── .gitignore                   # Git ignore rules
├── data/
│   ├── example_security_logs.json  # Example labeled security logs
│   └── README.md                   # Dataset preparation guide
├── configs/
│   ├── lora_a10.yaml               # Config for A10 GPU
│   ├── qlora_t4.yaml               # Config for T4 GPU (budget)
│   └── lora_a100.yaml              # Config for A100 GPU
└── output/                         # Training outputs (git-ignored)

💰 Cost Estimates

Platform	GPU	$/hr	5K log samples	50K log samples
Vast.ai	RTX 4090	$0.25-0.40	~$1-2	~$5-10
Vast.ai	A100 80GB	$0.75-1.35	~$1-3	~$8-15
AWS	g5.xlarge (A10G)	$1.00-1.20	~$3-5	~$15-25
RunPod	RTX 4090	$0.35-0.44	~$1-3	~$6-12

🗺️ Roadmap

🤝 Contributing

Contributions welcome! High-impact areas:

Labeled datasets — share anonymized, labeled log samples
Log parsers — add parsers for new log sources
Evaluation — build benchmarks for detection accuracy
Integration — connectors for SIEM platforms
Documentation — dataset preparation guides, deployment tutorials

⚠️ Disclaimer

SecLM is a research and analysis tool. It is not a replacement for production security monitoring systems, SIEM platforms, or professional SOC analysts. Always validate findings with your security team before taking action. The model may produce false positives or miss threats. Use as a supplementary analysis layer, not as a sole detection mechanism.

📄 License

Apache 2.0 — free for commercial and non-commercial use.

🙏 Acknowledgments

Qwen Team (Alibaba) — Qwen3-8B base model
Hugging Face — Transformers, PEFT, TRL
MITRE ATT&CK — Threat classification framework
Canadian Institute for Cybersecurity — Public IDS datasets

If this helps your security operations, give it a ⭐!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ SecLM — Fine-Tuned LLM for Log Analysis & Threat Detection

🎯 What SecLM Does

🔍 Example

🏗️ Architecture

📋 Supported Log Sources

🔥 Threat Categories Covered

📦 Dataset Format

Dataset Sources for Training

📋 Requirements

🚀 Quick Start

⚙️ Recommended Training Configs

💾 VRAM Usage

🎯 Inference & Deployment

Load Fine-Tuned Model

Merge for Production Deployment

Integration with SIEM / SOC Pipeline

📁 Project Structure

💰 Cost Estimates

🗺️ Roadmap

🤝 Contributing

⚠️ Disclaimer

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🛡️ SecLM — Fine-Tuned LLM for Log Analysis & Threat Detection

🎯 What SecLM Does

🔍 Example

🏗️ Architecture

📋 Supported Log Sources

🔥 Threat Categories Covered

📦 Dataset Format

Dataset Sources for Training

📋 Requirements

🚀 Quick Start

⚙️ Recommended Training Configs

💾 VRAM Usage

🎯 Inference & Deployment

Load Fine-Tuned Model

Merge for Production Deployment

Integration with SIEM / SOC Pipeline

📁 Project Structure

💰 Cost Estimates

🗺️ Roadmap

🤝 Contributing

⚠️ Disclaimer

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages