PII Anonymization

OpenMed's deidentify() API supports five redaction methods for detected PII entities:

Method	Output	Use when
`mask`	`[NAME]`, `[EMAIL]`, …	You want clear placeholders.
`remove`	`""` (deleted)	You don't need positional alignment.
`replace`	Locale-aware fake surrogates	You need realistic-looking text.
`hash`	`NAME_a1b2c3d4` (entity-typed digest)	You need consistent linking across docs.
`shift_dates`	Dates only — shifted by N days	You want to preserve relative time.

This document focuses on replace, which was upgraded in v1.3.0 to a full Faker-backed obfuscation engine.

The new `replace` engine

method="replace" no longer picks from a small hardcoded list. It builds a per-document Anonymizer (see openmed.core.anonymizer) that delegates to Faker with custom providers for clinical IDs.

from openmed import deidentify

deidentify(
    "Paciente Pedro Almeida, CPF: 123.456.789-09",
    method="replace",
    lang="pt",
    locale="pt_BR",          # default for pt is pt_PT; override per call
    consistent=True,         # same input -> same surrogate within doc
    seed=42,                 # cross-run reproducibility
)

Locale resolution

lang (an ISO 639-1 code OpenMed uses everywhere) maps to a Faker locale via LANG_TO_LOCALE:

OpenMed `lang`	Faker locale	Notes
`en`	`en_US`
`fr`	`fr_FR`
`de`	`de_DE`
`it`	`it_IT`
`es`	`es_ES`
`nl`	`nl_NL`
`hi`	`hi_IN`
`te`	`en_IN`	Faker has no Telugu locale — emits a one-time warning.
`pt`	`pt_PT`	Override with `locale="pt_BR"` for Brazilian Portuguese.

Pass locale= explicitly to override per call (e.g. pt_BR to generate CPF/CNPJ surrogates instead of Portuguese NIF/VAT).

Determinism

Three modes:

Random (default). Every call samples fresh surrogates. Good for audits or when you want visible variability.
consistent=True. Same (canonical_label, original_value) pair resolves to the same surrogate within the call. "John Doe" appearing twice in one document gets one surrogate.
seed=<int> (implies consistent=True). Same seed across runs produces the same surrogate stream — useful for snapshot tests and regression fixtures.

Determinism uses hashlib.blake2b over (seed, canonical_label, original), so different originals always get different surrogates.

Format preservation

Phone numbers, dates, and emails preserve the structure of the original:

deidentify("Call (415) 555-1234", method="replace", consistent=True, seed=1)
#  -> "Call (XXX) XXX-XXXX"  (digit groups, separators, country code position)

deidentify("Born 01/15/1970", method="replace", consistent=True, seed=1)
#  -> "Born MM/DD/YYYY"      (separator and ordering kept)

deidentify("Email: john@hospital.org", method="replace", lang="en")
#  -> "Email: alice@hospital.org"  (domain kept, local part faked)

Clinical ID checksums

OpenMed reuses the existing checksum validators in openmed.core.pii_i18n so every surrogate ID passes the same validator that detection uses:

Locale	ID type	Provider
`pt_BR`	CPF	Faker built-in (`pt_BR.cpf`)
`pt_BR`	CNPJ	Faker built-in (`pt_BR.cnpj`)
`nl_NL`	BSN (Elfproef)	Faker built-in (`nl_NL.ssn`)
`fr_FR`	NIR	Faker built-in (`fr_FR.ssn`)
`it_IT`	Codice Fiscale	Faker built-in (`it_IT.ssn`)
`es_ES`	NIE	Faker built-in (`es_ES.nie`)
`en_IN`	Aadhaar (Verhoeff)	OpenMed `AadhaarProvider` (Faker's built-in is invalid)
`de_DE`	Steuer-ID	OpenMed `GermanSteuerIdProvider` (Faker's `de_DE.ssn` is US-style)
any	NPI (Luhn over 80840)	OpenMed `NPIProvider`
any	Generic MRN	OpenMed `MedicalRecordNumberProvider`

Extending

from openmed import register_clinical_provider, register_label_generator
from faker.providers import BaseProvider

# Add your own checksum-bearing ID format
class HospitalAccountProvider(BaseProvider):
    def hospital_account(self):
        return f"HACC-{self.numerify('########')}"

register_clinical_provider(HospitalAccountProvider)

# Override the generator for a canonical label
def my_first_name(faker, original, *, locale):
    return faker.first_name() + "-test"

register_label_generator("FIRST_NAME", my_first_name)

Privacy-filter family

OpenMed ships three privacy-filter families, all the same OpenAI Privacy Filter architecture (gpt-oss-style sparse-MoE transformer with local attention, sink tokens, RoPE+YaRN, tiktoken o200k_base), differing only in their training data:

Variant	Trained on	PyTorch artifact	MLX (full)	MLX (8-bit)
OpenAI Privacy Filter	OpenAI's PII training set	`openai/privacy-filter`	`OpenMed/privacy-filter-mlx`	`OpenMed/privacy-filter-mlx-8bit`
OpenAI Nemotron Privacy Filter	Nemotron PII dataset	`OpenMed/privacy-filter-nemotron`	`OpenMed/privacy-filter-nemotron-mlx`	`OpenMed/privacy-filter-nemotron-mlx-8bit`
OpenMed Multilingual Privacy Filter	OpenMed multilingual PII corpus, 16 languages	`OpenMed/privacy-filter-multilingual`	`OpenMed/privacy-filter-multilingual-mlx`	`OpenMed/privacy-filter-multilingual-mlx-8bit`

All run through the same extract_pii() / deidentify() API — only the weights differ:

extract_pii(text, model_name="OpenMed/privacy-filter-mlx-8bit")
extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
extract_pii(text, model_name="OpenMed/privacy-filter-multilingual-mlx-8bit")

deidentify(text, model_name="OpenMed/privacy-filter-nemotron",
           method="replace", consistent=True, seed=42)
deidentify(text, model_name="OpenMed/privacy-filter-multilingual",
           method="replace", consistent=True, seed=42)

Backend selection. On Apple Silicon with MLX importable, the MLX artifact runs natively via PrivacyFilterMLXPipeline. Elsewhere, the call substitutes the corresponding PyTorch model via transformers and emits a one-time UserWarning explaining the swap. The fallback is family-aware — an MLX-only Nemotron request on Linux substitutes OpenMed/privacy-filter-nemotron, and an MLX-only multilingual request substitutes OpenMed/privacy-filter-multilingual, so the user gets the same training distribution they asked for.

Either way the output entity dicts have the same shape so the rest of the pipeline behaves identically. Smart-merging (regex-based span construction) is skipped for this family — the model already does Viterbi-constrained BIOES decoding internally.

The dispatch lives in openmed.core.backends.create_privacy_filter_pipeline. To register a new fine-tune that should fall back to its own PyTorch repo on non-Mac hosts, add a row to _TORCH_FALLBACK_BY_FAMILY in that module. If a fine-tune introduces a genuinely different architecture (not just new weights), it would also need a new MLX model class and family branch in openmed.mlx.models.build_model — but a same-architecture fine-tune needs neither.

Backwards compatibility

replace outputs are no longer drawn from the prior hardcoded list (["Jane Smith", "John Doe", "Alex Johnson", "Sam Taylor"], etc.). Any test that asserted on those exact strings must either:

pass consistent=True, seed=<value> and update the expected output, or
assert non-equality with the original instead of equality with a hardcoded surrogate.

Other methods (mask, remove, hash, shift_dates) are unchanged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PII Anonymization

The new `replace` engine

Locale resolution

Determinism

Format preservation

Clinical ID checksums

Extending

Privacy-filter family

Backwards compatibility

FilesExpand file tree

anonymization.md

Latest commit

History

anonymization.md

File metadata and controls

PII Anonymization

The new replace engine

Locale resolution

Determinism

Format preservation

Clinical ID checksums

Extending

Privacy-filter family

Backwards compatibility

The new `replace` engine