Date: 2026-01-31
batchalign align can consume ~3–4+ GB per worker on this machine, and scaling worker count multiplies that footprint. The dominant costs are full‑file audio tensors plus large model weights and intermediate inference buffers (forced alignment and UTR). This is partly unavoidable given current model architectures, but there are practical changes that can reduce peak memory or improve sharing.
- Per‑worker RSS peaks observed: ~3.0–4.2 GB.
- With 10 workers on a 34 GB system, available memory dropped below 8 GB while new workers were still being scheduled.
- Low‑memory warnings appear while workers are still active, consistent with a crash when aggregate RSS exceeds RAM.
WhisperFAModel,Wave2VecFAModel, andWhisperASRModelcalltorchaudio.load()and keep a full‑length tensor in memory.- Slices are used for alignment, but the full tensor remains resident per worker.
- Each worker is a separate process; models are loaded inside each worker.
- There is no cross‑process model sharing, so weights are duplicated N times.
- Whisper FA uses
output_attentions=Trueand concatenates cross‑attention tensors, which can be large. - Wave2Vec FA computes emission matrices over segments; still heavy for long spans.
- Whisper UTR loads a full ASR model and processes full audio if utterance timings are missing.
- Rev UTR avoids local model compute, but still processes full CHAT + alignment.
BatchalignPipelinedeep‑copies the document before processing; this duplicates large transcripts in memory.
- mp4→wav conversion previously re‑ran every time; now skipped when a same‑basename
.wavexists.
- Within a worker: yes.
_worker_pipelineis cached globally per process; engines keep their models once loaded. - Across workers: no. Each process loads its own model copy. This is the primary source of multiplicative memory usage.
- Result caching: alignment/UTR results are cached, but this does not reduce model memory.
- Model weights (Whisper/Wav2Vec) + large inference buffers are inherently large.
- Full audio tensor loads unless we switch to streaming.
- Process duplication: sharing models across workers or reducing worker count.
- Full‑file audio loads: stream or map audio to avoid full‑tensor residency.
- UTR work: skip when utterance timings already exist.
- Whisper FA attention buffers: optional alternative or chunking to reduce attention tensor sizes.
- Adaptive worker cap based on observed RSS peaks (see proposal).
- Default
alignworker cap (e.g., min(cpu_count, 6)). - Per‑run limit flag in docs (recommend <=6 on 34 GB systems).
- Threaded inference + shared models
- Use a smaller number of model processes (1–2) with task queues.
- Avoids N copies of weights.
- Model server process
- Central worker owns model; children send audio segments for alignment.
- Stream audio: load only needed segments from disk rather than full tensor.
- Shorter FA segments: configurable max segment length reduces emission/attention size.
- Lower precision / quantization where supported.
- Explicit GC between files: release tensors and call
torch.cuda.empty_cache()for GPU builds.
- Skip UTR when utterance timings already exist (documented flag or auto‑detect).
- Prefer Wave2Vec FA over Whisper FA for memory reasons.
- Already addressed: skip mp4 conversion if
.wavexists.
- Implement adaptive worker cap (proposed separately) to prevent OOM without sacrificing throughput.
- Add optional streaming audio path for FA engines to avoid full‑file tensors.
- Add a “model‑shared” mode (single model process + job queue) for memory‑constrained machines.
- Extend memlog to record per‑file audio duration and model type to better correlate with RSS peaks.
- For 34 GB systems: use
--workers 5or--workers 6. - Pre‑convert mp4 to wav (or keep wavs beside mp4s) to avoid pre‑run conversion overhead.
- Use Rev UTR (default) and Wave2Vec FA (default) to minimize memory.
- Pros: no code changes; users can scale up or down; isolates crashes to a subset of files.
- Cons: completely uncoordinated memory usage; easy to exceed RAM; no shared logging, no backpressure, no centralized progress; duplicated model loads per process; difficult for non-experts to tune.
- Net: better than strict sequential, but unreliable and user-hostile.
- Pros: centralized scheduling; adaptive cap prevents most OOM; mem-guard can fail fast; single CLI for users; predictable output; easier telemetry.
- Cons: each worker still loads its own models; peak RSS scales with workers; still vulnerable to very large inputs if caps are misestimated.
- Net: best short-term balance of throughput and safety on a single server.
- Pros: would reduce memory by sharing read-only weights.
- Cons: unsafe with MPS; fragile across platforms; still duplicates audio tensors; high crash risk on macOS.
- Net: not viable on this hardware.
Shared-model prefork is only viable when the underlying device supports fork safely and model weights stay read-only.
| Platform / Device | Status | Likely viable tools | Constraints |
|---|---|---|---|
| macOS + MPS | Not safe | None | Fork + MPS is unsafe; child processes can crash (observed). |
| macOS + CPU-only | Possible | Whisper FA/UTR, Wave2Vec FA, Stanza | Must disable MPS; performance slower but memory sharing works. |
| Linux + CUDA | Possible (with caveats) | Whisper FA/UTR, Wave2Vec FA, Pyannote | CUDA supports fork, but ensure models are initialized before forking; avoid lazy CUDA init in workers. |
| Linux + CPU-only | Likely safe | All CPU-only pipelines | Most consistent environment for prefork sharing. |
| Windows | Not viable | None | Uses spawn, no fork-based sharing. |
Notes:
- Whisper / Wave2Vec: heavy models benefit most from sharing. Prefer prefork only when device is CPU or CUDA and
forkis supported. - Stanza morphotag: CPU-bound and large, but may still benefit from prefork on Linux/CPU.
- Rev.ai UTR: remote service; sharing models is irrelevant.
Introduce a CLI flag (e.g., --force-cpu or --no-mps) to disable MPS so macOS users can opt into prefork shared-model mode. This keeps the default safe (MPS on), but offers a user-controlled pathway when memory pressure is more important than speed.
batchalign --force-cpu --shared-models align <in> <out>
- Pros: true single-copy model weights; explicit backpressure; can batch for throughput; avoids fork/MPS issues.
- Cons: higher implementation complexity; new IPC bottlenecks; server crash can stall all work; serialization overhead.
- Net: best long-term reliability + memory profile, but slower to deliver.
Stay with built-in multiprocessing and adaptive caps/mem-guard. It replaces the manual multi-process hack with coordinated scheduling and safety controls without requiring a complex refactor.
We now cache the median worker RSS peaks and file-size ratios per command so adaptive caps can start with a better estimate before any workers finish. This improves the initial cap decision and reduces early over-commit on cold starts.