Skip to content

Add recompile-free extraction via persisted module attributes#437

Open
oliver-kriska wants to merge 4 commits into
elixir-gettext:mainfrom
oliver-kriska:extract-from-attributes
Open

Add recompile-free extraction via persisted module attributes#437
oliver-kriska wants to merge 4 commits into
elixir-gettext:mainfrom
oliver-kriska:extract-from-attributes

Conversation

@oliver-kriska

@oliver-kriska oliver-kriska commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Claude Code generated text:

Motivation

mix gettext.extract discovers messages by force-recompiling the whole project, because extraction happens during gettext macro expansion and the results only live in the extractor agent for that single compilation. On large
projects this dominates extraction time, and because --check-up-to-date runs in CI on every change, the cost is paid constantly.

This implements the recompile-free design @maennchen proposed in #373 (#373 (comment)). That issue was closed once the immediate duplicate-reference / module redefinition bug was fixed in v1.0.2, but the extraction redesign itself was never implemented.

Approach

Two halves:

  1. Write path — during normal compilation, when Mix.env() is in the new:extraction_environments config (default [:dev]), each gettext macro persists its message into an accumulated, persist: true module attribute
    @__gettext_messages__ on the calling module. Backends additionally persist a @__gettext_backend_module__ marker.
  2. Read pathmix gettext.extract --from-attributes runs a normal incremental compile, then reads those attributes back from the compiled BEAM files with :beam_lib.chunks(beam, [:attributes]) (without loading the modules), feeds them into the existing Gettext.ExtractorAgent, and reuses the unchanged Gettext.Extractor.pot_files/2 merge logic.

The force-recompile path is untouched and remains the default.

Why it stays correct

  • Staleness needs no bookkeeping. The BEAM files are the cache: a changed source file is recompiled by the incremental compile and gets fresh attributes; a deleted file's BEAM is pruned by the compiler. No manifest, no mtime/hash logic. --check-up-to-date composes with --from-attributes and gets the same speedup.
  • Releases are unaffected. Environments outside :extraction_environments (notably :prod) persist nothing — no attributes, no marker. Unit-tested (a fixture compiled in :test carries no gettext attributes on any BEAM).
  • No silent data loss. If the scan finds zero messages and zero backends (e.g. the first run after upgrading, with stale attribute-less BEAMs), the task raises with instructions to mix compile --force rather than silently pruning every autogenerated message or silently falling back to a surprise full recompile.
  • Equivalence. Old-path and new-path POT output is byte-identical; covered by tests and validated against a large production app.

Benchmarks

Large production Phoenix app (1,875 files, 11 domains × 16 locales, ~4,700 msgids; Elixir 1.20 / OTP 29):

force-recompile (current) --from-attributes
gettext.extract --merge 49.7s 7.0s
gettext.extract --check-up-to-date (CI) ~60s 4.6s

In that app's CI the extraction step dropped from ~66–94s to ~4–11s and has run on every merge for several days with identical POT output. Dev BEAM size grew ~2.7% (≈ +2.3 MiB for ~4,700 msgids); prod is unaffected.

Scope / limitations

  • Opt-in. Default behavior is unchanged; the fast path is active only when you pass --from-attributes and set :extraction_environments.
  • Modern Gettext.Macros backends only. The deprecated use Gettext, otp_app: ... backends generate their own *_noop macros and don't persist attributes; the zero-found error catches a legacy-only project that opts in.
  • extraction_apps (cross-app dependency extraction) deferred. gettext.extract is @recursive, so each umbrella child already scans its own compile path (covered by a fixture test); extracting messages that live
    in dependencies is new functionality for a follow-up.

Design notes for review

  • Persisted attributes + :beam_lib, rather than the @before_compile __translations__/0 function from the original sketch: same data, no function injected into user modules, and reading attributes doesn't load the modules
    (so scanning 1,000+ BEAMs stays cheap and side-effect-free). Happy to switch to the generated-function approach if you prefer it.
  • The new path feeds the existing ExtractorAgent rather than refactoring pot_files/2, so duplicate-message merging, plural-mismatch warnings and comment de-duplication are reused verbatim — equivalence is structural, not re-implemented.
  • Happy to split this into smaller commits (attribute writing / BEAM reading / task wiring) if that is easier to review.

Tests

New test/mix/tasks/gettext.extract_from_attributes_test.exs covers: old-path / new-path byte-equivalence (fresh + idempotent), macro-generated messages (gettext calls injected into modules that don't textually contain them),
_with_backend variants, reference merging across modules sharing a msgid, --from-attributes --merge, custom :priv, --check-up-to-date staleness via the incremental compile, umbrella extraction, and prod-clean BEAMs.

@josevalim

Copy link
Copy Markdown
Contributor

Do we even need the module attribute? Maybe instead we could simply start the Agent on demand during compilation with the existing mode? Then later we can swap it to use ETS or whatever.

@oliver-kriska

Copy link
Copy Markdown
Contributor Author

but how will we persist cross compilation? So it would mean we will have to still do force compile which can be pretty expensive. So basically your suggested solution is how it works now if I'm not wrong, with that add_message function.

@josevalim

Copy link
Copy Markdown
Contributor

Oh, sorry, I have completely misread it. I see.

The reason I don't like the module attributes is that, IIRC, they are loaded once the code is loaded, and that's also a runtime cost. Another option would be to track additional files per module? And then we prune those files when assembing the .po if they don't have an equivalent .beam.

@maennchen

Copy link
Copy Markdown
Member

@josevalim What do you think about just disabling those attributes for prod? For dev, the runtime cost should be negligible.

I like the approach with module attributes because you have the same incremental compilation as normal and you can extract at any point by enumerating the modules / their attributes. You could for example extract while there's a running phoenix dev server.

@oliver-kriska

Copy link
Copy Markdown
Contributor Author

it should be already there, see those lines

defp extraction_environment? do
    cond do
      not Code.ensure_loaded?(Mix) ->
        false

      is_nil(Mix.Project.get()) ->
        false

      true ->
        gettext_config = Mix.Project.config()[:gettext] || []

        extraction_environments =
          gettext_config[:extraction_environments] ||
            Application.get_env(:gettext, :extraction_environments) ||
            [:dev]

        Mix.env() in extraction_environments
    end
  end

https://github.com/elixir-gettext/gettext/pull/437/changes#diff-b815910ebc5507c67ea8edc182bd8917455114497fbbf5427cc26255b3a30d6bR135-R140

Comment thread lib/mix/tasks/gettext.extract.ex Outdated
Mix.Task.reenable("compile.elixir")
Mix.Task.reenable("compile.app")
Mix.Task.run("compile", [])
Mix.Task.run("compile.elixir", [])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need all of this. We can compile and assume that already puts everything in place? Or it may be better if we use separate tasks maybe? mix gettext.generate or something?

@oliver-kriska oliver-kriska Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, trying something with mix gettext.extract_from_attributes it's too long but .generate is "too wide meaning". Open for other ideas

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to maintain two extraction features in parallel? I trust that this approach will be stable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d maintain both for now. This package is likely a decade old, this is an essential part, id not necessarily assume we won’t have blockers and regressions. We can merge them together later.

@josevalim

Copy link
Copy Markdown
Contributor

Ok, agreed then. I would probably make it like this though:

# config/dev.exs
config :my_app, MyGettext.Backend, automatic_extraction: true

So we are not accessing Mix from deep within Gettext.

It may also be worth treating extraction with and without this feature as completely different things? They can share some steps (such as the merging and fusing) but it may be clearer to treat them as separate code paths execution wise.

oliver-kriska added a commit to oliver-kriska/gettext that referenced this pull request Jun 16, 2026
…utes task

Following Jose's review on PR elixir-gettext#437, split the attribute-based extraction out
of the --from-attributes flag into a dedicated mix gettext.extract_from_attributes
task. The two extraction modes are now separate code paths that share only the
POT merge/fuse and output steps via Mix.Tasks.Gettext.Extract.process/3.

Also drop the incremental_compile/0 reenable dance. The new task does a plain
mix compile (a no-op when the project is already compiled), since by the time it
reads the persisted attributes the project is already compiled and up to date.
@oliver-kriska oliver-kriska force-pushed the extract-from-attributes branch from d27c40c to abd4bf7 Compare June 16, 2026 12:58
mix gettext.extract can only discover messages by force-recompiling the
whole project, because extraction happens during macro expansion and the
results only exist in the extractor agent for the duration of that
compilation. On large projects this dominates extraction time.

Implement the design proposed by @maennchen in elixir-gettext#373: during normal
compilation, when the current Mix environment is included in the new
:extraction_environments gettext configuration (default [:dev]), the
gettext macros persist each message into an accumulated, persisted
@__gettext_messages__ attribute of the calling module, and backends
persist a @__gettext_backend_module__ marker. A new experimental
--from-attributes flag on mix gettext.extract then runs a normal
incremental compile, reads the persisted entries back from the compiled
BEAM files with :beam_lib (without loading modules), feeds them into the
existing extractor agent, and reuses the unchanged
Gettext.Extractor.pot_files/2 merge logic.

Staleness needs no bookkeeping: changed files are recompiled by the
incremental compile and get fresh attributes, and deleted files' beams
are pruned by the compiler. --check-up-to-date composes with the new
flag and gets the same speedup. Environments not listed in
:extraction_environments (such as :prod) persist nothing, so release
artifacts are unaffected. If the scan finds no persisted messages or
backends at all, the task raises with instructions instead of silently
producing empty output.

The force-recompile path is untouched and remains the default. Old-path
and new-path POT output is byte-identical, covered by tests including
macro-generated messages (gettext calls injected into modules that do
not literally mention them) and plural/context/comment/noop variants.
Address review findings on the --from-attributes prototype:

  * Validate the shape of persisted message entries when scanning BEAM
    files, skipping malformed ones instead of crashing the scan with a
    CaseClauseError (an unrelated module could carry an attribute with
    the same name, and future format changes need a safe path).

  * Reenable "compile" and friends in the incremental compile, so that
    extraction still picks up source changes when the compile tasks
    already ran in the current VM (for example, through a task alias).

  * Guard the backend marker registration with Module.has_attribute?/2,
    matching persist_message/6.

Test improvements: assert fixed content on the rebuilt POT (not only
equality with the recompilation path), reset the extractor agent's full
initial state between tests, make reference line assertions robust to
fixture edits, pin the stale-POT filename in the --check-up-to-date
assertion, and restore Code.compiler_options/1 after the suite.

New coverage: _with_backend macro variants, reference merging when two
modules share a msgid, --from-attributes combined with --merge, custom
:priv backends, and extraction across umbrella apps (each app gets its
own POT through the task's recursion).
oliver-kriska added a commit to oliver-kriska/gettext that referenced this pull request Jun 16, 2026
…utes task

Following Jose's review on PR elixir-gettext#437, split the attribute-based extraction out
of the --from-attributes flag into a dedicated mix gettext.extract_from_attributes
task. The two extraction modes are now separate code paths that share only the
POT merge/fuse and output steps via Mix.Tasks.Gettext.Extract.process/3.

Also drop the incremental_compile/0 reenable dance. The new task does a plain
mix compile (a no-op when the project is already compiled), since by the time it
reads the persisted attributes the project is already compiled and up to date.
@oliver-kriska oliver-kriska force-pushed the extract-from-attributes branch from abd4bf7 to 11ae3cd Compare June 16, 2026 15:33
Replace the Mix-environment gate (:extraction_environments) with a
per-backend application-environment toggle, read without touching Mix:

    config :gettext, MyApp.Gettext, automatic_extraction: true

Reading from the application environment instead of Mix.Project/Mix.env
keeps Mix out of the compile-time macro path, is off by default, and is
off outside of Mix (such as runtime compilation in a release), so prod
beams stay free of the persisted attributes.

The toggle is keyed under the :gettext application (not the backend's own
otp_app) so the persist gate never calls the backend at a gettext call
site, avoiding a compile-time dependency that would recompile every
gettext caller whenever PO files change.

The persisted data, BEAM reading, and POT merging are unchanged, so the
--from-attributes output is byte-identical to the recompilation path.
…utes task

Following Jose's review on PR elixir-gettext#437, split the attribute-based extraction out
of the --from-attributes flag into a dedicated mix gettext.extract_from_attributes
task. The two extraction modes are now separate code paths that share only the
POT merge/fuse and output steps via Mix.Tasks.Gettext.Extract.process/3.

Also drop the incremental_compile/0 reenable dance. The new task does a plain
mix compile (a no-op when the project is already compiled), since by the time it
reads the persisted attributes the project is already compiled and up to date.
@oliver-kriska oliver-kriska force-pushed the extract-from-attributes branch from 11ae3cd to c357fca Compare June 16, 2026 15:46
@oliver-kriska

Copy link
Copy Markdown
Contributor Author

fyi I updated code but I want to properly test it with our big translations, but feedback welcome. Also with this oliver-kriska#2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants