I'm an ML Engineer who ended up running GPU infrastructure at a scale most ML people only read about. That path wasn't planned — it happened because the problems at the systems layer turned out to be more interesting than I expected.
Right now I work at the edge of HPC and MLOps: I've kept 1,500+ GPU nodes running in production (H100, H200, AMD MI210, DGX SuperPod), handled distributed training verification for DeepSpeed ZeRO and FSDP workloads, and built the tooling teams actually use — not the polished kind, the kind that fixes a broken GPU reporting pipeline at 2am before a client review.
Currently pushing toward full LLMOps: RAG pipelines, model registries, and making inference actually deployable at scale.
| Domain | Tools |
|---|---|
| HPC Scheduling | SLURM, GRES, sacct/sacctmgr, NVIDIA BCM, AWX/Ansible |
| Distributed Training | DeepSpeed ZeRO (1/2/3), FSDP, multi-node GPU setups |
| LLM Inference | Ollama, model serving, API workflows, GPU-aware env setup |
| Cluster Storage | DDN Lustre, parallel I/O, Singularity, containerized workloads |
| Benchmarking | HPL, RCCL, STREAM, CUDA benchmarks, Intel MPI |
| ML Modeling | Prophet, LSTM, XGBoost, CNNs, SageMaker pipelines |
- RAG pipeline — local LLM inference with retrieval, no cloud API dependency
- GPU utilization reporter — accurate per-user consumption from SLURM GRES records (generalizing the fix I built for production)
- Finishing my MS in Electrical Engineering (AI track) at Ain Shams University
- BrightSkies / Core42 (G42) — Senior HPC Systems Engineer, Azure H100/H200/MI210 clusters (Abu Dhabi)
- BrightSkies / SDAIA — Sole technical owner, 60-node DGX H100 SuperPod (Riyadh)
- BrightSkies / KAUST — HPC support + LLM inference tooling for research clusters
- elmenus — Data Scientist, demand forecasting and operational ML
- Omdena — ML Engineer, applied projects in computer vision and time-series forecasting
- BSc Electrical Engineering, Alexandria University — GPA 3.4, Very Good with Honours
Alexandria, Egypt — open to remote and hybrid roles


