OpenMed's deidentify() API supports five redaction methods for detected
PII entities:
| Method | Output | Use when |
|---|---|---|
mask |
[NAME], [EMAIL], … |
You want clear placeholders. |
remove |
"" (deleted) |
You don't need positional alignment. |
replace |
Locale-aware fake surrogates | You need realistic-looking text. |
hash |
NAME_a1b2c3d4 (entity-typed digest) |
You need consistent linking across docs. |
shift_dates |
Dates only — shifted by N days | You want to preserve relative time. |
This document focuses on replace, which was upgraded in v1.3.0 to a full
Faker-backed obfuscation engine.
method="replace" no longer picks from a small hardcoded list. It builds
a per-document Anonymizer (see
openmed.core.anonymizer)
that delegates to Faker with custom providers
for clinical IDs.
from openmed import deidentify
deidentify(
"Paciente Pedro Almeida, CPF: 123.456.789-09",
method="replace",
lang="pt",
locale="pt_BR", # default for pt is pt_PT; override per call
consistent=True, # same input -> same surrogate within doc
seed=42, # cross-run reproducibility
)lang (an ISO 639-1 code OpenMed uses everywhere) maps to a Faker
locale via LANG_TO_LOCALE:
OpenMed lang |
Faker locale | Notes |
|---|---|---|
en |
en_US |
|
fr |
fr_FR |
|
de |
de_DE |
|
it |
it_IT |
|
es |
es_ES |
|
nl |
nl_NL |
|
hi |
hi_IN |
|
te |
en_IN |
Faker has no Telugu locale — emits a one-time warning. |
pt |
pt_PT |
Override with locale="pt_BR" for Brazilian Portuguese. |
Pass locale= explicitly to override per call (e.g. pt_BR to generate
CPF/CNPJ surrogates instead of Portuguese NIF/VAT).
Three modes:
- Random (default). Every call samples fresh surrogates. Good for audits or when you want visible variability.
consistent=True. Same(canonical_label, original_value)pair resolves to the same surrogate within the call. "John Doe" appearing twice in one document gets one surrogate.seed=<int>(impliesconsistent=True). Same seed across runs produces the same surrogate stream — useful for snapshot tests and regression fixtures.
Determinism uses hashlib.blake2b over (seed, canonical_label, original),
so different originals always get different surrogates.
Phone numbers, dates, and emails preserve the structure of the original:
deidentify("Call (415) 555-1234", method="replace", consistent=True, seed=1)
# -> "Call (XXX) XXX-XXXX" (digit groups, separators, country code position)
deidentify("Born 01/15/1970", method="replace", consistent=True, seed=1)
# -> "Born MM/DD/YYYY" (separator and ordering kept)
deidentify("Email: john@hospital.org", method="replace", lang="en")
# -> "Email: alice@hospital.org" (domain kept, local part faked)OpenMed reuses the existing checksum validators in
openmed.core.pii_i18n
so every surrogate ID passes the same validator that detection uses:
| Locale | ID type | Provider |
|---|---|---|
pt_BR |
CPF | Faker built-in (pt_BR.cpf) |
pt_BR |
CNPJ | Faker built-in (pt_BR.cnpj) |
nl_NL |
BSN (Elfproef) | Faker built-in (nl_NL.ssn) |
fr_FR |
NIR | Faker built-in (fr_FR.ssn) |
it_IT |
Codice Fiscale | Faker built-in (it_IT.ssn) |
es_ES |
NIE | Faker built-in (es_ES.nie) |
en_IN |
Aadhaar (Verhoeff) | OpenMed AadhaarProvider (Faker's built-in is invalid) |
de_DE |
Steuer-ID | OpenMed GermanSteuerIdProvider (Faker's de_DE.ssn is US-style) |
| any | NPI (Luhn over 80840) | OpenMed NPIProvider |
| any | Generic MRN | OpenMed MedicalRecordNumberProvider |
from openmed import register_clinical_provider, register_label_generator
from faker.providers import BaseProvider
# Add your own checksum-bearing ID format
class HospitalAccountProvider(BaseProvider):
def hospital_account(self):
return f"HACC-{self.numerify('########')}"
register_clinical_provider(HospitalAccountProvider)
# Override the generator for a canonical label
def my_first_name(faker, original, *, locale):
return faker.first_name() + "-test"
register_label_generator("FIRST_NAME", my_first_name)OpenMed ships three privacy-filter families, all the same OpenAI
Privacy Filter architecture (gpt-oss-style sparse-MoE transformer with
local attention, sink tokens, RoPE+YaRN, tiktoken o200k_base), differing
only in their training data:
| Variant | Trained on | PyTorch artifact | MLX (full) | MLX (8-bit) |
|---|---|---|---|---|
| OpenAI Privacy Filter | OpenAI's PII training set | openai/privacy-filter |
OpenMed/privacy-filter-mlx |
OpenMed/privacy-filter-mlx-8bit |
| OpenAI Nemotron Privacy Filter | Nemotron PII dataset | OpenMed/privacy-filter-nemotron |
OpenMed/privacy-filter-nemotron-mlx |
OpenMed/privacy-filter-nemotron-mlx-8bit |
| OpenMed Multilingual Privacy Filter | OpenMed multilingual PII corpus, 16 languages | OpenMed/privacy-filter-multilingual |
OpenMed/privacy-filter-multilingual-mlx |
OpenMed/privacy-filter-multilingual-mlx-8bit |
All run through the same extract_pii() / deidentify() API — only the
weights differ:
extract_pii(text, model_name="OpenMed/privacy-filter-mlx-8bit")
extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
extract_pii(text, model_name="OpenMed/privacy-filter-multilingual-mlx-8bit")
deidentify(text, model_name="OpenMed/privacy-filter-nemotron",
method="replace", consistent=True, seed=42)
deidentify(text, model_name="OpenMed/privacy-filter-multilingual",
method="replace", consistent=True, seed=42)Backend selection. On Apple Silicon with MLX importable, the MLX
artifact runs natively via PrivacyFilterMLXPipeline. Elsewhere, the
call substitutes the corresponding PyTorch model via transformers and
emits a one-time UserWarning explaining the swap. The fallback is
family-aware — an MLX-only Nemotron request on Linux substitutes
OpenMed/privacy-filter-nemotron, and an MLX-only multilingual request
substitutes OpenMed/privacy-filter-multilingual, so the user gets the same
training distribution they asked for.
Either way the output entity dicts have the same shape so the rest of the pipeline behaves identically. Smart-merging (regex-based span construction) is skipped for this family — the model already does Viterbi-constrained BIOES decoding internally.
The dispatch lives in
openmed.core.backends.create_privacy_filter_pipeline.
To register a new fine-tune that should fall back to its own PyTorch repo
on non-Mac hosts, add a row to _TORCH_FALLBACK_BY_FAMILY in that module.
If a fine-tune introduces a genuinely different architecture (not just
new weights), it would also need a new MLX model class and family branch
in openmed.mlx.models.build_model — but a same-architecture fine-tune
needs neither.
replace outputs are no longer drawn from the prior hardcoded list
(["Jane Smith", "John Doe", "Alex Johnson", "Sam Taylor"], etc.).
Any test that asserted on those exact strings must either:
- pass
consistent=True, seed=<value>and update the expected output, or - assert non-equality with the original instead of equality with a hardcoded surrogate.
Other methods (mask, remove, hash, shift_dates) are unchanged.