Skip to content

Commit 468ab57

Browse files
authored
Merge pull request #7 from ml-cube/dev-drift-analysis
- New module drift analysis that given a dataset perform drift detection (batch or streaming) to identify different concepts in data - Created general implementation of monitoring algorithm to be used by both sklearn and hugging face implementations
2 parents 7bbbfdc + c69b39d commit 468ab57

42 files changed

Lines changed: 2291 additions & 1034 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/actions/validation/action.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ runs:
1717

1818
- name: 🦾 💅 🧪 Install and validate extras
1919
run: |
20-
extras=("sklearn" "huggingface")
20+
extras=("sklearn" "huggingface", "polars", "pandas")
2121
2222
for extra in "${extras[@]}"; do
2323
echo "🦾 Installing extra: $extra"

Justfile

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,34 @@ set quiet
77
default:
88
just --list --unsorted
99

10+
# --------------------------------------------------
11+
# Developer Setup
12+
1013
# Synchronize the environment by installing all the dependencies
1114
dev-sync:
1215
uv sync --cache-dir .uv_cache --all-extras
1316

17+
# Synchronize the environment by installing the specified extra dependency
18+
# Currently used within the CI to install extra dependencies and test them.
1419
dev-sync-extra extra:
1520
uv sync --cache-dir .uv_cache --extra {{extra}}
1621

1722
# Synchronize the environment by installing all the dependencies except the dev ones
1823
prod-sync:
19-
uv sync --cache-dir .uv_cache --all-extras --no-dev
24+
uv sync --cache-dir .uv_cache --all-extras --no-default-groups
25+
26+
# Synchronize the environment by installing the extra dependency
27+
# specified. Doesn't install the dev dependencies.
28+
prod-sync-extra extra:
29+
uv sync --cache-dir .uv_cache --extra {{extra}} --no-default-groups
2030

2131
# Install the pre-commit hooks
2232
install-hooks:
2333
uv run pre-commit install
2434

35+
# --------------------------------------------------
36+
# Validation
37+
2538
# Run ruff formatting
2639
format:
2740
uv run ruff format
@@ -31,9 +44,26 @@ lint:
3144
uv run ruff check --fix
3245
uv run mypy --ignore-missing-imports --install-types --non-interactive --package ml3_drift
3346

47+
48+
# Default value for testWorkers is auto (meaning all workers available)
49+
# If you want to pass a custom value (such as 4): `just testWorkers=4 test`
50+
# We also run ruff on tests files (it's so fast that it's worth it)
51+
52+
# Little caveat: when running tests with only an extra installed, you'd like
53+
# to avoid having docs dependencies installed (since, for instance, a mkdocs plugin
54+
# requires Pandas, which is one of our extra dependencies). This happens by default
55+
# since docs dependencies are not installed as default dependencies by uv (see pyproject.toml).
56+
# They are only installed when building / serving the documentation. However, if you first
57+
# build the documentation, then run the tests, you will have the docs dependencies installed.
58+
# Should not be a practical problem (especially since in CI environments we don't install docs dependencies),
59+
# but it's worth noting.
60+
3461
# Run the tests with pytest
62+
testWorkers := "auto"
3563
test:
36-
uv run pytest --verbose --color=yes -n auto --exitfirst tests
64+
uv run ruff format tests
65+
uv run ruff check tests --fix
66+
uv run pytest --verbose --color=yes -n {{testWorkers}} --exitfirst tests
3767

3868
# Run linters, formatters and tests
3969
validate: format lint test
@@ -43,11 +73,12 @@ validate: format lint test
4373

4474
# Generate the documentation
4575
build-docs:
46-
uv run mkdocs build
76+
# Make sure mkdocs is installed
77+
uv run --group docs mkdocs build
4778

4879
# Serve the documentation locally
4980
serve-docs:
50-
uv run mkdocs serve
81+
uv run --group docs mkdocs serve
5182

5283
# --------------------------------------------------
5384
# Publishing

examples/huggingface/text_embedding_monitoring.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33
from ml3_drift.huggingface.drift_detection_pipeline import (
44
HuggingFaceDriftDetectionPipeline,
55
)
6-
from ml3_drift.huggingface.univariate.ks import KSDriftDetector
6+
from ml3_drift.monitoring.multivariate.bonferroni import BonferroniCorrectionAlgorithm
7+
from ml3_drift.monitoring.univariate.continuous.ks import KSAlgorithm
78
from ml3_drift.callbacks.base import logger_callback
89

910

@@ -37,7 +38,8 @@
3738
# to monitor the drift in the embeddings.
3839

3940
hf_pipe = HuggingFaceDriftDetectionPipeline(
40-
drift_detector=KSDriftDetector(
41+
drift_detector=BonferroniCorrectionAlgorithm(
42+
algorithm=KSAlgorithm(p_value=0.05),
4143
callbacks=[
4244
partial(
4345
logger_callback,

pyproject.toml

Lines changed: 31 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,52 +7,73 @@ dynamic = ["version"]
77
license = { text = "Apache-2.0" }
88
readme = "README.md"
99

10-
dependencies = []
10+
dependencies = [
11+
"scipy>=1.15.3",
12+
]
1113

1214
# -------------------------------------------------
1315
# Extra dependencies. This package is designed to be
14-
# used within one extra at a time, hence we check each
15-
# extra separately. Remember to update the list of extras
16-
# in the validation action to ensure tests are run
16+
# used with different libraries, which means that our code
17+
# should work only when not all extras are installed.
18+
# Remember to update the list of extras
19+
# in the validation CICD to ensure tests are run
1720
# for your new extra
1821
[project.optional-dependencies]
1922

2023
sklearn = ["scikit-learn>=1.6.1"]
2124

2225
huggingface = ["scipy>=1.15.2", "transformers[torch]>=4.52.3"]
2326

27+
polars = ["polars>=1.31.0"]
28+
29+
pandas = ["pandas>=2.2.3"]
30+
2431

2532
# -------------------------------------------------
2633

2734
[dependency-groups]
2835
dev = [
2936
"ipykernel>=6.29.5",
3037
"mypy>=1.15.0",
31-
"pillow>=11.2.1", # for image support in tests
3238
"pre-commit>=4.1.0",
3339
"pytest>=8.3.4",
3440
"pytest-xdist>=3.6.1",
3541
"ruff>=0.9.5",
36-
# for docs
42+
# for image support in tests
43+
"pillow>=11.2.1",
44+
]
45+
46+
docs = [
47+
"mkdocs-minify-plugin>=0.7.1",
48+
"mkdocs-glightbox>=0.3.4",
49+
"mkdocs-table-reader-plugin>=2.0.1",
50+
"mkdocs-macros-plugin",
3751
"mkdocs>=1.5.0",
3852
"mkdocs-material>=9.5.0",
3953
"mkdocs-material-extensions>=1.1",
4054
"pygments>=2.14",
4155
"pymdown-extensions>=9.9.1",
4256
"jinja2>=3.0",
4357
"markdown>=3.2",
44-
"mkdocs-minify-plugin>=0.7.1",
45-
"mkdocs-glightbox>=0.3.4",
46-
"mkdocs-table-reader-plugin>=2.0.1",
47-
"mkdocs-macros-plugin",
4858
"openpyxl",
4959
]
5060

5161
# -------------------------------------------------
5262

63+
# Default groups for uv
64+
[tool.uv]
65+
default-groups = ["dev"]
66+
67+
# -------------------------------------------------
68+
5369
[build-system]
5470
requires = ["hatchling"]
5571
build-backend = "hatchling.build"
5672

5773
[tool.hatch.version]
5874
path = "src/ml3_drift/__init__.py"
75+
76+
77+
# Set pytest folder
78+
[tool.pytest.ini_options]
79+
testpaths = ["tests"]

ruff.toml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,6 @@ exclude = [
3333
line-length = 88
3434
indent-width = 4
3535

36-
# Assume Python 3.9
37-
target-version = "py39"
38-
3936
[lint]
4037
# Enable Pyflakes (`F`) and a subset of the pycodestyle (`E`) codes by default.
4138
# Unlike Flake8, Ruff doesn't enable pycodestyle warnings (`W`) or
File renamed without changes.
File renamed without changes.
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
from abc import ABC, abstractmethod
2+
import numpy as np
3+
from typing import TYPE_CHECKING, Union
4+
from typing_extensions import TypeIs
5+
6+
from ml3_drift.analysis.report import Report
7+
from ml3_drift.monitoring.base import MonitoringAlgorithm
8+
from ml3_drift.monitoring.multivariate.bonferroni import BonferroniCorrectionAlgorithm
9+
from ml3_drift.monitoring.univariate.continuous.ks import KSAlgorithm
10+
from ml3_drift.monitoring.univariate.discrete.chi_square import (
11+
ChiSquareAlgorithm,
12+
)
13+
14+
if TYPE_CHECKING:
15+
import pandas as pd
16+
import polars as pl
17+
18+
POLARS = True
19+
try:
20+
import polars as pl
21+
except ModuleNotFoundError:
22+
POLARS = False
23+
24+
25+
PANDAS = True
26+
try:
27+
import pandas as pd
28+
except ModuleNotFoundError:
29+
PANDAS = False
30+
31+
32+
class DataDriftAnalyzer(ABC):
33+
"""
34+
Analyze a dataset identifying the sequence of distributions due to data drifts.
35+
36+
Parameters
37+
----------
38+
continuous_monitoring_algorithm: MonitoringAlgorithm | None
39+
Algorithm used to monitor continuous data. If None, a default algorithm is used.
40+
categorical_monitoring_algorithm: MonitoringAlgorithm | None
41+
Algorithm used to monitor categorical data. If None, a default algorithm is used.
42+
"""
43+
44+
def __init__(
45+
self,
46+
continuous_monitoring_algorithm: MonitoringAlgorithm | None = None,
47+
categorical_monitoring_algorithm: MonitoringAlgorithm | None = None,
48+
):
49+
# We use default algorithms if None is provided.
50+
if continuous_monitoring_algorithm is None:
51+
continuous_monitoring_algorithm = BonferroniCorrectionAlgorithm(
52+
algorithm=KSAlgorithm(),
53+
)
54+
if categorical_monitoring_algorithm is None:
55+
categorical_monitoring_algorithm = BonferroniCorrectionAlgorithm(
56+
algorithm=ChiSquareAlgorithm(),
57+
)
58+
59+
self.continuous_monitoring_algorithm = continuous_monitoring_algorithm
60+
self.categorical_monitoring_algorithm = categorical_monitoring_algorithm
61+
62+
def _is_list_str(self, columns: list[str] | list[int]) -> TypeIs[list[str]]:
63+
"""Verify if the input variable is a list of str in any element"""
64+
65+
return all(isinstance(elem, str) for elem in columns)
66+
67+
def _to_index(
68+
self,
69+
X: Union[np.ndarray, "pd.DataFrame", "pl.DataFrame"],
70+
columns: list[str] | list[int] | None,
71+
) -> list[int]:
72+
"""Translate the list of columns in list of indices.
73+
74+
If columns is None then all the indexes are returned.
75+
If columns is list[int] then it is directly returned.
76+
If columns is list[str] then the indexes are retrieved from column names,
77+
in this case X must be a DataFrame."""
78+
79+
if columns is None:
80+
return list(range(X.shape[0]))
81+
82+
if self._is_list_str(columns):
83+
if POLARS and isinstance(X, pl.DataFrame):
84+
return [i for (i, c) in enumerate(X.columns) if c in columns]
85+
elif PANDAS and isinstance(X, pd.DataFrame):
86+
return [i for (i, c) in enumerate(X.columns) if c in columns]
87+
else:
88+
raise ValueError(
89+
f"Type not valid, expecting polars DataFrame or pandas DataFrame when columns has string values. Got {type(X)}"
90+
)
91+
return columns
92+
93+
def _to_numpy(
94+
self, X: Union[np.ndarray, "pd.DataFrame", "pl.DataFrame"]
95+
) -> np.ndarray:
96+
"""Transform input data into numpy array"""
97+
98+
if POLARS and isinstance(X, pl.DataFrame):
99+
return X.to_numpy()
100+
elif PANDAS and isinstance(X, pd.DataFrame):
101+
return X.to_numpy()
102+
elif isinstance(X, np.ndarray):
103+
return X
104+
else:
105+
raise ValueError(
106+
f"Type not valid, expecting numpy array, polars DataFrame or pandas DataFrame. Got {type(X)}"
107+
)
108+
109+
@abstractmethod
110+
def _scan_data(
111+
self,
112+
X: np.ndarray,
113+
y: np.ndarray | None,
114+
continuous_columns_ids: list[int],
115+
categorical_columns_ids: list[int],
116+
y_categorical: bool,
117+
) -> Report:
118+
"""Scan the data to identify different data partitions according to monitoring algorithm."""
119+
120+
def analyze(
121+
self,
122+
X: Union[np.ndarray, "pd.DataFrame", "pl.DataFrame"],
123+
y: Union[None, np.ndarray, "pd.DataFrame", "pl.DataFrame"],
124+
continuous_columns: list[str] | list[int] | None,
125+
categorical_columns: list[str] | list[int] | None,
126+
y_categorical: bool,
127+
) -> Report:
128+
"""Analyze the data to split them into different distribution according to drift detectors.
129+
130+
If target is provided then concept drift is used as split criterion, otherwise, it uses input drift.
131+
132+
Parameters
133+
----------
134+
X: input data. Can be numpy array, pandas dataframe or polars dataframe
135+
y: target data. It is optional and can be numpy array, pandas dataframe or polars dataframe
136+
continuous_columns: if not None it is the indices or names of the columns that are continuous
137+
categorical_columns: if not None it is the indices or names of the columns that are categorical
138+
y_categorical: if True, then the target is categorical, otherwise it is considered as continuous
139+
140+
Output
141+
------
142+
Report object containing information about identified data groups
143+
"""
144+
# Shape check
145+
if y is not None and X.shape[0] != y.shape[0]:
146+
raise ValueError(
147+
f"When target y is not None it must have the same rows of input X. Got X: {X.shape} and y: {y.shape}"
148+
)
149+
150+
# Continuous and categorical columns to canonical form
151+
if continuous_columns is not None:
152+
continuous_columns_ids = self._to_index(X, continuous_columns)
153+
else:
154+
continuous_columns_ids = []
155+
156+
if categorical_columns is not None:
157+
categorical_columns_ids = self._to_index(X, categorical_columns)
158+
else:
159+
categorical_columns_ids = []
160+
161+
# Input and target in canonical form
162+
array_X = self._to_numpy(X)
163+
164+
if y is not None:
165+
array_y = self._to_numpy(y)
166+
else:
167+
array_y = None
168+
169+
# Data analysis
170+
report = self._scan_data(
171+
array_X,
172+
array_y,
173+
continuous_columns_ids,
174+
categorical_columns_ids,
175+
y_categorical,
176+
)
177+
178+
return report

0 commit comments

Comments
 (0)