Skip to content

perf: compress embedded espeak-ng-data with rust-embed#62

Closed
pszymkowiak wants to merge 2 commits into
mainfrom
perf/compress-espeak-data
Closed

perf: compress embedded espeak-ng-data with rust-embed#62
pszymkowiak wants to merge 2 commits into
mainfrom
perf/compress-espeak-data

Conversation

@pszymkowiak

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #60. The embedded espeak-ng-data directory was shipped uncompressed via include_dir!, inflating the release binary from ~27 MB (pre-fix) to ~50 MB. The bytes are mostly phoneme dictionaries — plain text and small binary tables that compress well.

Swap include_dir = "0.7" for rust-embed = { version = "8", features = ["compression", "interpolate-folder-path"] }. Files are DEFLATE-compressed at build time and decompressed when extracted (once, on first piper call). Expected sizes:

Before #60 After #60 (include_dir) This PR (rust-embed + deflate)
Linux release binary ~27 MB 50 MB ~32 MB (estimate)
Linux release tarball ~12 MB 22 MB ~13 MB (estimate)

Same staging path ($OUT_DIR/espeak-ng-data populated by build.rs), same first-run extraction to ~/.config/vox/piper/espeak-ng-data/, same PIPER_ESPEAKNG_DATA_DIRECTORY env-var dance. No user-facing change.

Test plan

Swap include_dir for rust-embed with the 'compression' feature. The
embedded files were uncompressed phoneme dictionaries (~15 MB plain
text/binary), DEFLATE typically gives 4-5x on this content. Expected
release binary drop: ~50 MB to ~32 MB, tarball ~22 MB to ~13 MB.

UX unchanged: first piper run still extracts to
~/.config/vox/piper/espeak-ng-data and sets PIPER_ESPEAKNG_DATA_DIRECTORY.
The sentinel file logic short-circuits on subsequent runs.
@pszymkowiak

Copy link
Copy Markdown
Contributor Author

Closing — rust-embed v8's compression feature requires a path relative to CARGO_MANIFEST_DIR, which rules out our $OUT_DIR-staged espeak-ng-data. The clean alternative (manual tar+gzip in build.rs + decompress at runtime) doubles the moving parts of the embed for a ~20 MB binary gain.

Per the analysis in #59, a 50 MB Rust ML CLI binary is within the 2026 norm (ruff ~30 MB, bun ~60 MB, deno ~85 MB) and the curl-install UX is unaffected. Keeping the simpler include_dir setup from #60 and revisiting only if binary size becomes a concrete pain point.

@pszymkowiak pszymkowiak deleted the perf/compress-espeak-data branch May 14, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant