Skip to content

manupanand-freelance-developer/seclm-log-threat-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ›ก๏ธ SecLM โ€” Fine-Tuned LLM for Log Analysis & Threat Detection

LoRA fine-tuned Qwen3-8B for security log analysis, threat detection, anomaly classification, and incident triage.

SecLM transforms a general-purpose LLM into a cybersecurity analyst that understands syslog, firewall logs, cloud audit trails, WAF events, EDR telemetry, and SIEM alerts โ€” and can classify threats, explain attack patterns, and recommend response actions.

License: Apache 2.0 Python 3.10+ PyTorch 2.2+ Qwen3-8B


๐ŸŽฏ What SecLM Does

Capability Description
Log Classification Classifies raw log entries as benign, suspicious, or malicious
Threat Detection Identifies attack patterns โ€” brute force, lateral movement, C2 callbacks, data exfiltration, privilege escalation
MITRE ATT&CK Mapping Maps detected threats to MITRE ATT&CK tactics, techniques, and sub-techniques
Anomaly Explanation Explains why a log entry or sequence is anomalous in plain language
Incident Triage Assigns severity (Critical / High / Medium / Low / Info) with reasoning
IOC Extraction Extracts indicators of compromise โ€” IPs, domains, hashes, file paths, user agents
Response Recommendations Suggests containment and remediation actions based on the detected threat
Multi-Source Analysis Understands logs from Linux syslog, Windows Event Logs, AWS CloudTrail, firewall/IDS, WAF, Kubernetes audit, and more

๐Ÿ” Example

Input:

Analyze this log entry:
Mar 15 03:42:18 web-prod-03 sshd[28451]: Failed password for root from 185.220.101.42 port 44831 ssh2
Mar 15 03:42:19 web-prod-03 sshd[28453]: Failed password for root from 185.220.101.42 port 44832 ssh2
Mar 15 03:42:19 web-prod-03 sshd[28455]: Failed password for admin from 185.220.101.42 port 44833 ssh2
Mar 15 03:42:20 web-prod-03 sshd[28457]: Accepted password for admin from 185.220.101.42 port 44834 ssh2

SecLM Output:

Classification: MALICIOUS
Severity: CRITICAL
Attack Type: SSH Brute Force โ†’ Successful Compromise

MITRE ATT&CK:
- Tactic: Initial Access (TA0001)
- Technique: Brute Force (T1110)
- Sub-technique: Password Guessing (T1110.001)

Analysis:
Rapid sequential failed SSH login attempts targeting multiple accounts (root, admin)
from a single source IP 185.220.101.42, followed by a successful authentication
for the 'admin' account. The 1-second interval between attempts indicates automated
tooling. The source IP is associated with known Tor exit nodes.

IOCs:
- Source IP: 185.220.101.42 (Tor exit node)
- Targeted accounts: root, admin
- Target host: web-prod-03

Recommended Actions:
1. IMMEDIATE: Disable the 'admin' account on web-prod-03
2. IMMEDIATE: Block 185.220.101.42 at firewall level
3. INVESTIGATE: Check for post-compromise activity โ€” new processes, cron jobs,
   SSH keys added, lateral movement attempts
4. HARDEN: Disable password authentication, enforce key-based SSH only
5. MONITOR: Enable enhanced logging on web-prod-03 for the next 72 hours

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    SecLM Pipeline                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                      โ”‚
โ”‚   Raw Logs โ”€โ”€โ†’ Preprocessing โ”€โ”€โ†’ SecLM (Qwen3-8B   โ”‚
โ”‚   (syslog,      (normalize,       + LoRA adapter)    โ”‚
โ”‚    CloudTrail,   chunk,                              โ”‚
โ”‚    WAF, EDR,     enrich)       โ”€โ”€โ†’ Structured Output โ”‚
โ”‚    k8s audit)                      (classification,  โ”‚
โ”‚                                     MITRE mapping,   โ”‚
โ”‚                                     IOCs, actions)   โ”‚
โ”‚                                                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Base Model: Qwen3-8B (Apache 2.0) Fine-Tuning Method: LoRA (rank 16โ€“64) / QLoRA for budget GPUs Training Data: Security log datasets with expert-labeled threat classifications

๐Ÿ“‹ Supported Log Sources

Source Log Types
Linux syslog, auth.log, kern.log, journald
Windows Security Event Log (4624, 4625, 4688, 4720, etc.), PowerShell logs
Cloud โ€” AWS CloudTrail, VPC Flow Logs, GuardDuty findings, WAF logs
Cloud โ€” Azure Activity Log, NSG Flow Logs, Entra ID sign-in logs
Cloud โ€” GCP Cloud Audit Logs, VPC Flow Logs
Network Firewall logs (iptables, pf, Palo Alto, Fortinet), IDS/IPS (Snort, Suricata)
Web Apache/Nginx access logs, WAF logs (ModSecurity, AWS WAF, Cloudflare)
Kubernetes Audit logs, Falco alerts, OPA/Gatekeeper denials
Endpoint EDR telemetry (CrowdStrike, SentinelOne, Wazuh)
SIEM Pre-parsed alerts from Splunk, Elastic SIEM, QRadar

๐Ÿ”ฅ Threat Categories Covered

Category Examples
Brute Force & Credential Attacks SSH brute force, password spraying, credential stuffing
Lateral Movement Pass-the-hash, RDP pivoting, SMB relay
Privilege Escalation sudo abuse, token manipulation, kernel exploits
Command & Control Beacon callbacks, DNS tunneling, encoded PowerShell
Data Exfiltration Unusual outbound transfers, DNS exfil, cloud storage abuse
Web Attacks SQLi, XSS, RFI/LFI, path traversal, SSRF
Persistence Cron jobs, systemd services, registry run keys, SSH authorized_keys
Reconnaissance Port scanning, service enumeration, directory brute forcing
Cloud-Specific IAM key abuse, S3 bucket exposure, security group modifications
Container/K8s Privileged container escape, RBAC bypass, crypto mining pods

๐Ÿ“ฆ Dataset Format

Training data follows a structured instruction format with security-specific fields:

[
  {
    "instruction": "You are SecLM, a cybersecurity log analysis expert. Analyze the following log entry and provide: classification (BENIGN/SUSPICIOUS/MALICIOUS), severity, attack type, MITRE ATT&CK mapping, analysis, IOCs, and recommended actions.",
    "input": "Mar 15 03:42:18 web-prod-03 sshd[28451]: Failed password for root from 185.220.101.42 port 44831 ssh2\nMar 15 03:42:19 web-prod-03 sshd[28453]: Failed password for root from 185.220.101.42 port 44832 ssh2",
    "output": "Classification: MALICIOUS\nSeverity: HIGH\nAttack Type: SSH Brute Force\n\nMITRE ATT&CK:\n- Tactic: Initial Access (TA0001)\n- Technique: Brute Force (T1110.001)\n\nAnalysis: Rapid repeated SSH authentication failures targeting root account from a single source IP with sequential ports indicates automated brute force attack.\n\nIOCs:\n- Source IP: 185.220.101.42\n- Target: root@web-prod-03\n\nRecommended Actions:\n1. Block source IP at firewall\n2. Verify root login is disabled in sshd_config\n3. Check if any attempt succeeded"
  }
]

Dataset Sources for Training

Source Type Description
CICIDS2017 Network IDS Labeled network traffic with multiple attack types
CSE-CIC-IDS2018 Network IDS Updated version with more attack scenarios
LANL Unified Host and Network Auth + Network 58 days of auth events from Los Alamos National Lab
Splunk Boss of the SOC (BOTS) SIEM Realistic SOC scenario datasets
Custom labeled logs Various Your own labeled production logs (recommended)
Synthetic generation Various LLM-generated log scenarios for data augmentation

๐Ÿ“‹ Requirements

Component Minimum (QLoRA) Recommended (LoRA)
GPU VRAM 10 GB 24 GB
System RAM 16 GB 32 GB
Disk 30 GB 50 GB
Python 3.10 3.11
CUDA 12.1 12.4

Tested GPUs: NVIDIA A10 (24GB), RTX 4090, RTX 3090, A100

๐Ÿš€ Quick Start

# Clone
git clone https://github.com/<your-username>/seclm-log-threat-detection.git
cd seclm-log-threat-detection

# Install
pip install -r requirements.txt
pip install flash-attn --no-build-isolation  # optional, recommended

# Test with example security logs
python train.py

# Train with your labeled log dataset
python train.py --dataset data/your_labeled_logs.json --epochs 3

# QLoRA mode (budget GPUs โ€” RTX 3090, T4)
python train.py --dataset data/your_labeled_logs.json --use_4bit

# Higher capacity for complex threat detection
python train.py --dataset data/your_labeled_logs.json --lora_rank 64 --max_seq_length 2048

โš™๏ธ Recommended Training Configs

Use Case GPU Command
Quick test Any 24GB python train.py
Production (A10) A10 24GB python train.py --dataset data/logs.json --lora_rank 32 --epochs 3
Budget training T4 16GB python train.py --dataset data/logs.json --use_4bit --batch_size 1 --max_seq_length 512
Max quality A100 80GB python train.py --dataset data/logs.json --lora_rank 64 --batch_size 4 --max_seq_length 2048

๐Ÿ’พ VRAM Usage

Config VRAM Notes
LoRA fp16, rank 16 ~18-20 GB Default, good for most log analysis tasks
LoRA fp16, rank 64 ~20-22 GB Better for complex multi-source correlation
QLoRA 4-bit, rank 16 ~10-12 GB Budget option, slight quality tradeoff
QLoRA 4-bit, rank 64 ~12-14 GB Best quality on budget hardware

๐ŸŽฏ Inference & Deployment

Load Fine-Tuned Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B", device_map="auto", torch_dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(base_model, "./output/<run_name>/final")
tokenizer = AutoTokenizer.from_pretrained("./output/<run_name>/final")

Merge for Production Deployment

merged = model.merge_and_unload()
merged.save_pretrained("./seclm-production")
tokenizer.save_pretrained("./seclm-production")
# Deploy with vLLM, TGI, or Ollama โ€” no PEFT dependency needed

Integration with SIEM / SOC Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Log      โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚  Preprocessingโ”‚โ”€โ”€โ”€โ”€โ†’โ”‚  SecLM    โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚  Alert /     โ”‚
โ”‚  Sources  โ”‚     โ”‚  & Batching   โ”‚     โ”‚  Inferenceโ”‚     โ”‚  SOAR Action โ”‚
โ”‚           โ”‚     โ”‚              โ”‚     โ”‚  (vLLM)   โ”‚     โ”‚              โ”‚
โ”‚ โ€ข Syslog  โ”‚     โ”‚ โ€ข Normalize  โ”‚     โ”‚           โ”‚     โ”‚ โ€ข PagerDuty  โ”‚
โ”‚ โ€ข Cloud   โ”‚     โ”‚ โ€ข Chunk      โ”‚     โ”‚ Returns:  โ”‚     โ”‚ โ€ข Slack      โ”‚
โ”‚ โ€ข WAF     โ”‚     โ”‚ โ€ข Filter     โ”‚     โ”‚ โ€ข Class   โ”‚     โ”‚ โ€ข JIRA       โ”‚
โ”‚ โ€ข EDR     โ”‚     โ”‚   noise      โ”‚     โ”‚ โ€ข MITRE   โ”‚     โ”‚ โ€ข Block IP   โ”‚
โ”‚ โ€ข K8s     โ”‚     โ”‚              โ”‚     โ”‚ โ€ข IOCs    โ”‚     โ”‚ โ€ข Isolate    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ Project Structure

seclm-log-threat-detection/
โ”œโ”€โ”€ train.py                     # Main fine-tuning script
โ”œโ”€โ”€ requirements.txt             # Python dependencies
โ”œโ”€โ”€ README.md                    # This file
โ”œโ”€โ”€ LICENSE                      # Apache 2.0
โ”œโ”€โ”€ .gitignore                   # Git ignore rules
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ example_security_logs.json  # Example labeled security logs
โ”‚   โ””โ”€โ”€ README.md                   # Dataset preparation guide
โ”œโ”€โ”€ configs/
โ”‚   โ”œโ”€โ”€ lora_a10.yaml               # Config for A10 GPU
โ”‚   โ”œโ”€โ”€ qlora_t4.yaml               # Config for T4 GPU (budget)
โ”‚   โ””โ”€โ”€ lora_a100.yaml              # Config for A100 GPU
โ””โ”€โ”€ output/                         # Training outputs (git-ignored)

๐Ÿ’ฐ Cost Estimates

Platform GPU $/hr 5K log samples 50K log samples
Vast.ai RTX 4090 $0.25-0.40 ~$1-2 ~$5-10
Vast.ai A100 80GB $0.75-1.35 ~$1-3 ~$8-15
AWS g5.xlarge (A10G) $1.00-1.20 ~$3-5 ~$15-25
RunPod RTX 4090 $0.35-0.44 ~$1-3 ~$6-12

๐Ÿ—บ๏ธ Roadmap

  • LoRA / QLoRA fine-tuning on Qwen3-8B
  • Multi-source log format support
  • MITRE ATT&CK mapping
  • Evaluation benchmarks (detection accuracy, false positive rate)
  • DPO alignment for reducing false positives
  • Multi-GPU training support (FSDP / DeepSpeed)
  • Streaming inference pipeline for real-time log analysis
  • Pre-built dataset generation scripts (synthetic + public datasets)
  • GGUF export for on-prem deployment via llama.cpp
  • vLLM inference server with REST API
  • Splunk / Elastic SIEM plugin
  • Grafana dashboard for detection metrics

๐Ÿค Contributing

Contributions welcome! High-impact areas:

  • Labeled datasets โ€” share anonymized, labeled log samples
  • Log parsers โ€” add parsers for new log sources
  • Evaluation โ€” build benchmarks for detection accuracy
  • Integration โ€” connectors for SIEM platforms
  • Documentation โ€” dataset preparation guides, deployment tutorials

โš ๏ธ Disclaimer

SecLM is a research and analysis tool. It is not a replacement for production security monitoring systems, SIEM platforms, or professional SOC analysts. Always validate findings with your security team before taking action. The model may produce false positives or miss threats. Use as a supplementary analysis layer, not as a sole detection mechanism.

๐Ÿ“„ License

Apache 2.0 โ€” free for commercial and non-commercial use.

๐Ÿ™ Acknowledgments


If this helps your security operations, give it a โญ!

About

Fine-tuned Qwen3-8B for cybersecurity log analysis and threat detection. Classifies security events, maps to MITRE ATT&CK, extracts IOCs, and recommends response actions. LoRA/QLoRA training on a single A10 GPU. Supports syslog, CloudTrail, WAF, EDR, Kubernetes audit, and SIEM logs. Apache 2.0.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

โšก