feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI by jyejare · Pull Request #6202 · feast-dev/feast

jyejare · 2026-03-31T10:53:28Z

What this PR does / why we need it:

This PR introduces comprehensive feature quality monitoring capabilities to Feast, enabling proactive tracking of feature distributions and data quality metrics. Currently, Feast has no built-in tools for monitoring feature health in production — ML teams must build custom solutions to detect issues like distribution shifts, elevated null rates, or degraded data quality before they silently impact model performance.

What it adds:

Core Monitoring Engine

Hybrid computation engine — SQL push-down on the native OfflineStore as the primary compute path, with a Python-based (PyArrow/NumPy) fallback for backends that don't implement native compute. This leverages the offline store as a compute engine (same architecture as Feast materialization).
Fully native storage — Monitoring metrics are stored within the configured OfflineStore backend itself (no separate monitoring database). Six static methods on the OfflineStore base class (compute_monitoring_metrics, get_monitoring_max_timestamp, ensure_monitoring_tables, save_monitoring_metrics, query_monitoring_metrics, clear_monitoring_baseline) handle compute and storage.
PyArrow-based metrics computation (MetricsCalculator) — Backend-agnostic statistical computation as fallback, supporting:
- Numeric features: mean, stddev, min/max, percentiles (p50/p75/p90/p95/p99), null rate, histograms
- Categorical features: top-N value counts with other/unique counts
- Automatic feature type classification from Feast's PrimitiveFeastType and ValueType

Multi-Backend Support (8 Offline Stores)

All 6 native monitoring methods implemented for each backend with dialect-specific SQL:

Backend	Compute	Storage	Dialect highlights
PostgreSQL	SQL push-down	`INSERT ON CONFLICT`	`PERCENTILE_CONT`, `WIDTH_BUCKET`
Snowflake	SQL push-down	`MERGE` with `VARIANT` JSON	`APPROX_PERCENTILE`, `WIDTH_BUCKET`
BigQuery	SQL push-down	`MERGE` into BQ tables	`APPROX_QUANTILES`, parameterized queries
Redshift	SQL push-down	`MERGE` via Data API	`APPROXIMATE PERCENTILE_DISC`
Spark	SparkSQL push-down	Parquet tables	`PERCENTILE_APPROX`, `spark.sql()`
Oracle	SQL via Ibis	`MERGE FROM DUAL`	`PERCENTILE_CONT WITHIN GROUP`
DuckDB	In-memory SQL	Parquet files	`QUANTILE_CONT`, `HISTOGRAM`
Dask	PyArrow compute	Parquet files	`pyarrow.compute` + `numpy`

Multi-Granularity Time-Series Metrics

5 granularities: daily, weekly, biweekly, monthly, quarterly
Auto-compute mode: Detects latest event timestamp and computes all granularities in one job
Pre-computed metrics stored per date + granularity for fast retrieval
On-demand transient compute: Fresh statistics for arbitrary date ranges (not stored)

Batch + Log Data Source Support

Batch source: Reads from the feature view's batch_source via OfflineStore.pull_all_from_table_or_query()
Log source: Reads from feature serving logs via FeatureService.logging_config destination, using __log_timestamp as event timestamp
Feature name normalization: Prefixed log column names (driver_stats__conv_rate) are parsed back to their original feature_view_name + feature_name for storage compatibility and drift detection
data_source_type column (batch / log) differentiates metrics in storage

Orchestration Service (`MonitoringService`)

Ties registry, offline store, calculator, and storage together
Computes and aggregates metrics at feature, feature view, and feature service levels
Cached OfflineStore instance for performance
Unified compute/timestamp methods handling both batch and log paths with SQL push-down + fallback

Shared Utilities (`monitoring_utils.py`)

Centralized table name constants, column lists, PK definitions
monitoring_table_meta(), opt_float(), empty_numeric_metric(), empty_categorical_metric(), normalize_monitoring_row(), build_view_aggregate()
Used by all 8 backends — eliminates duplication and ensures consistency

DQM Job Engine (`DQMJobManager`)

Asynchronous job abstraction for metric computation (compute, baseline, auto_compute)
Job status tracking in feast_monitoring_jobs table
Supports future integration with Ray/Spark job runners

REST API (`/monitoring/`)

Method	Endpoint	Description
`POST`	`/monitoring/compute`	Submit batch DQM job
`POST`	`/monitoring/auto_compute`	Auto-detect dates, all granularities
`POST`	`/monitoring/compute/transient`	On-demand compute (not stored)
`POST`	`/monitoring/compute/log`	Compute from serving logs
`POST`	`/monitoring/auto_compute/log`	Auto-detect log dates, all granularities
`GET`	`/monitoring/jobs/{job_id}`	DQM job status
`GET`	`/monitoring/metrics/features`	Per-feature metrics
`GET`	`/monitoring/metrics/feature_views`	Per-view aggregates
`GET`	`/monitoring/metrics/feature_services`	Per-service aggregates
`GET`	`/monitoring/metrics/baseline`	Baseline distribution retrieval
`GET`	`/monitoring/metrics/timeseries`	Time-series data for trend analysis

All endpoints support cascading filters: project, feature_service_name, feature_view_name, feature_name, granularity, data_source_type, date range.

RBAC enforced using existing AuthzedAction.DESCRIBE (read) and AuthzedAction.UPDATE (compute).

CLI (`feast monitor run`)

Options:
  --feature-view TEXT     Feature view name (omit for all)
  --feature-name TEXT     Feature name(s), repeatable
  --start-date TEXT       Start date YYYY-MM-DD (omit for auto-detect)
  --end-date TEXT         End date YYYY-MM-DD (omit for auto-detect)
  --granularity TEXT      daily | weekly | biweekly | monthly | quarterly
  --set-baseline          Mark this computation as baseline
  --source-type TEXT      batch | log | all (default: batch)

Auto-Baseline on `feast apply`

Automatically queues baseline metric computation for new features on feast apply
Non-blocking (async DQM job), idempotent (skips existing baselines)
Configurable — can be disabled via feature_store.yaml:

feature_server:
  dqm:
    distribution:
      initial:
        enabled: false

Feast Operator Support

New CRD types: DqmConfig, DqmDistributionConfig, DqmInitialDistributionConfig added to FeatureStoreSpec
Operator generates feature_server.dqm section in feature_store.yaml when DQM config is set
DeepCopy methods auto-generated via make generate
Disabling auto-baseline from operator CR:

apiVersion: feast.dev/v1
kind: FeatureStore
spec:
  feastProject: my_project
  dqm:
    distribution:
      initial:
        enabled: false

Documentation

How-to guide: docs/how-to-guides/feature-monitoring.md — Production setup, CLI usage, REST API reference, orchestrator integration (Airflow, KFP, cron, K8s CronJob), backend compatibility table
Quickstart notebook: examples/monitoring/monitoring-quickstart.ipynb — 12-step hands-on walkthrough with visualization examples
docs/SUMMARY.md updated with links to both

Design decisions:

Native OfflineStore compute + storage — Each backend implements its own SQL push-down for metrics calculation and uses its native UPSERT/MERGE for storage. No separate monitoring database needed.
Hybrid fallback — Backends that don't implement native compute fall back to Python/PyArrow, ensuring all offline stores are supported.
Separate /monitoring/ route rather than extending existing /metrics/ — The existing metrics route serves registry inventory metadata; monitoring serves statistical feature quality data with a different data path.
DQM Job Engine for async computation — Supports future Ray/Spark integration for distributed metric computation.

Which issue(s) this PR fixes:

Partially Fixes #5919

Checks

I've made sure the tests are passing.
My commits are signed off (git commit -s)
My PR title follows conventional commits format

Testing Strategy

Unit tests
Integration tests
Operator unit tests (Ginkgo)

Test coverage (all passing):

Test Suite	Count	Covers
`test_metrics_calculator.py`	19	Numeric/categorical computation, edge cases (empty, all-null, single value, high cardinality), type classification, PyArrow type classification
`test_monitoring_integration.py`	16+	End-to-end batch/log computation, baseline flow, view/service aggregation, native storage dispatch, log feature name normalization, REST API endpoints, CLI, RBAC enforcement
`repo_config_test.go`	92	Operator repo config generation including DQM config with initial distribution disabled, YAML serialization verification

Snyk SAST scan: 0 vulnerabilities across all new files.

devin-ai-integration

Devin Review found 2 new potential issues.

View 8 additional findings in Devin Review.

ntkathole · 2026-05-06T13:58:03Z

 	BatchEngine     *BatchEngineConfig    `json:"batchEngine,omitempty"`
+	// Dqm configures Data Quality Monitoring behaviour.
+	// +optional
+	Dqm *DqmConfig `json:"dqm,omitempty"`


dqm is not user friendly, I think it's better to name it feature_monitoring or something else

@ntkathole dqm is very known term in developers, data scientists. Also feature_monitoring is a global scope term which covers both operational metrics + data quality metrics. So I keep it limited to data quality.

I can rephrase it to DataQualityMonitoring if you think thats better ?

yea, DataQualityMonitoring explicit is better than short name

ntkathole · 2026-05-06T14:06:20Z

    (feast_feature_freshness_seconds)."""


+class DqmInitialDistributionConfig(FeastConfigBaseModel):


I think these configs should also live at top instead of feature_server. It uses the offline store, not the online server. This field is more similar to materialization, which is top-level config.

#feature_store.yaml feature_monitoring: auto_baseline: false

This matches the pattern: materialization: spans offline+online stores, openlineage: spans apply+materialize - feature_monitoring: spans offline store (compute/storage) + apply trigger + server API.

This could be due to the existing metrics config is residing under server config. The metrics(even operational metrics) could also be computed for offline store. So I think its good if we move both operational metrics and dqm metrics under the parent monitoring config.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>

jyejare requested a review from a team as a code owner March 31, 2026 10:53

jyejare marked this pull request as draft March 31, 2026 10:54

jyejare force-pushed the monitoring_plus branch from 4340dbb to 940a4af Compare March 31, 2026 10:54

This comment was marked as resolved.

Sign in to view

jyejare force-pushed the monitoring_plus branch 4 times, most recently from d0b45bb to c06853e Compare April 21, 2026 14:00

jyejare marked this pull request as ready for review April 21, 2026 14:34

jyejare requested review from a team and sudohainguyen as code owners April 21, 2026 14:34

jyejare requested review from lokeshrangineni, robhowley and tokoko and removed request for a team April 21, 2026 14:34

devin-ai-integration Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread sdk/python/feast/infra/offline_stores/bigquery.py Outdated

Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated

jyejare changed the title ~~feat: Add feature quality monitoring with statistical metrics, REST API, and CLI~~ feat: Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config Apr 21, 2026

jyejare changed the title ~~feat: [Backend] Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config~~ feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI Apr 21, 2026

jyejare mentioned this pull request May 5, 2026

[Feature] Built-in feature drift detection with alerting #6341

Open

jyejare force-pushed the monitoring_plus branch from 3da4dde to 0344087 Compare May 5, 2026 08:52

ntkathole force-pushed the monitoring_plus branch from 0344087 to 3c73a70 Compare May 6, 2026 11:59