Skip to content

feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI#6202

Open
jyejare wants to merge 7 commits intofeast-dev:masterfrom
jyejare:monitoring_plus
Open

feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI#6202
jyejare wants to merge 7 commits intofeast-dev:masterfrom
jyejare:monitoring_plus

Conversation

@jyejare
Copy link
Copy Markdown
Collaborator

@jyejare jyejare commented Mar 31, 2026

What this PR does / why we need it:

This PR introduces comprehensive feature quality monitoring capabilities to Feast, enabling proactive tracking of feature distributions and data quality metrics. Currently, Feast has no built-in tools for monitoring feature health in production — ML teams must build custom solutions to detect issues like distribution shifts, elevated null rates, or degraded data quality before they silently impact model performance.

What it adds:

Core Monitoring Engine

  • Hybrid computation engine — SQL push-down on the native OfflineStore as the primary compute path, with a Python-based (PyArrow/NumPy) fallback for backends that don't implement native compute. This leverages the offline store as a compute engine (same architecture as Feast materialization).
  • Fully native storage — Monitoring metrics are stored within the configured OfflineStore backend itself (no separate monitoring database). Six static methods on the OfflineStore base class (compute_monitoring_metrics, get_monitoring_max_timestamp, ensure_monitoring_tables, save_monitoring_metrics, query_monitoring_metrics, clear_monitoring_baseline) handle compute and storage.
  • PyArrow-based metrics computation (MetricsCalculator) — Backend-agnostic statistical computation as fallback, supporting:
    • Numeric features: mean, stddev, min/max, percentiles (p50/p75/p90/p95/p99), null rate, histograms
    • Categorical features: top-N value counts with other/unique counts
    • Automatic feature type classification from Feast's PrimitiveFeastType and ValueType

Multi-Backend Support (8 Offline Stores)

All 6 native monitoring methods implemented for each backend with dialect-specific SQL:

Backend Compute Storage Dialect highlights
PostgreSQL SQL push-down INSERT ON CONFLICT PERCENTILE_CONT, WIDTH_BUCKET
Snowflake SQL push-down MERGE with VARIANT JSON APPROX_PERCENTILE, WIDTH_BUCKET
BigQuery SQL push-down MERGE into BQ tables APPROX_QUANTILES, parameterized queries
Redshift SQL push-down MERGE via Data API APPROXIMATE PERCENTILE_DISC
Spark SparkSQL push-down Parquet tables PERCENTILE_APPROX, spark.sql()
Oracle SQL via Ibis MERGE FROM DUAL PERCENTILE_CONT WITHIN GROUP
DuckDB In-memory SQL Parquet files QUANTILE_CONT, HISTOGRAM
Dask PyArrow compute Parquet files pyarrow.compute + numpy

Multi-Granularity Time-Series Metrics

  • 5 granularities: daily, weekly, biweekly, monthly, quarterly
  • Auto-compute mode: Detects latest event timestamp and computes all granularities in one job
  • Pre-computed metrics stored per date + granularity for fast retrieval
  • On-demand transient compute: Fresh statistics for arbitrary date ranges (not stored)

Batch + Log Data Source Support

  • Batch source: Reads from the feature view's batch_source via OfflineStore.pull_all_from_table_or_query()
  • Log source: Reads from feature serving logs via FeatureService.logging_config destination, using __log_timestamp as event timestamp
  • Feature name normalization: Prefixed log column names (driver_stats__conv_rate) are parsed back to their original feature_view_name + feature_name for storage compatibility and drift detection
  • data_source_type column (batch / log) differentiates metrics in storage

Orchestration Service (MonitoringService)

  • Ties registry, offline store, calculator, and storage together
  • Computes and aggregates metrics at feature, feature view, and feature service levels
  • Cached OfflineStore instance for performance
  • Unified compute/timestamp methods handling both batch and log paths with SQL push-down + fallback

Shared Utilities (monitoring_utils.py)

  • Centralized table name constants, column lists, PK definitions
  • monitoring_table_meta(), opt_float(), empty_numeric_metric(), empty_categorical_metric(), normalize_monitoring_row(), build_view_aggregate()
  • Used by all 8 backends — eliminates duplication and ensures consistency

DQM Job Engine (DQMJobManager)

  • Asynchronous job abstraction for metric computation (compute, baseline, auto_compute)
  • Job status tracking in feast_monitoring_jobs table
  • Supports future integration with Ray/Spark job runners

REST API (/monitoring/)

Method Endpoint Description
POST /monitoring/compute Submit batch DQM job
POST /monitoring/auto_compute Auto-detect dates, all granularities
POST /monitoring/compute/transient On-demand compute (not stored)
POST /monitoring/compute/log Compute from serving logs
POST /monitoring/auto_compute/log Auto-detect log dates, all granularities
GET /monitoring/jobs/{job_id} DQM job status
GET /monitoring/metrics/features Per-feature metrics
GET /monitoring/metrics/feature_views Per-view aggregates
GET /monitoring/metrics/feature_services Per-service aggregates
GET /monitoring/metrics/baseline Baseline distribution retrieval
GET /monitoring/metrics/timeseries Time-series data for trend analysis

All endpoints support cascading filters: project, feature_service_name, feature_view_name, feature_name, granularity, data_source_type, date range.

RBAC enforced using existing AuthzedAction.DESCRIBE (read) and AuthzedAction.UPDATE (compute).

CLI (feast monitor run)

Options:
  --feature-view TEXT     Feature view name (omit for all)
  --feature-name TEXT     Feature name(s), repeatable
  --start-date TEXT       Start date YYYY-MM-DD (omit for auto-detect)
  --end-date TEXT         End date YYYY-MM-DD (omit for auto-detect)
  --granularity TEXT      daily | weekly | biweekly | monthly | quarterly
  --set-baseline          Mark this computation as baseline
  --source-type TEXT      batch | log | all (default: batch)

Auto-Baseline on feast apply

  • Automatically queues baseline metric computation for new features on feast apply
  • Non-blocking (async DQM job), idempotent (skips existing baselines)
  • Configurable — can be disabled via feature_store.yaml:
feature_server:
  dqm:
    distribution:
      initial:
        enabled: false

Feast Operator Support

  • New CRD types: DqmConfig, DqmDistributionConfig, DqmInitialDistributionConfig added to FeatureStoreSpec
  • Operator generates feature_server.dqm section in feature_store.yaml when DQM config is set
  • DeepCopy methods auto-generated via make generate
  • Disabling auto-baseline from operator CR:
apiVersion: feast.dev/v1
kind: FeatureStore
spec:
  feastProject: my_project
  dqm:
    distribution:
      initial:
        enabled: false

Documentation

  • How-to guide: docs/how-to-guides/feature-monitoring.md — Production setup, CLI usage, REST API reference, orchestrator integration (Airflow, KFP, cron, K8s CronJob), backend compatibility table
  • Quickstart notebook: examples/monitoring/monitoring-quickstart.ipynb — 12-step hands-on walkthrough with visualization examples
  • docs/SUMMARY.md updated with links to both

Design decisions:

  • Native OfflineStore compute + storage — Each backend implements its own SQL push-down for metrics calculation and uses its native UPSERT/MERGE for storage. No separate monitoring database needed.
  • Hybrid fallback — Backends that don't implement native compute fall back to Python/PyArrow, ensuring all offline stores are supported.
  • Separate /monitoring/ route rather than extending existing /metrics/ — The existing metrics route serves registry inventory metadata; monitoring serves statistical feature quality data with a different data path.
  • DQM Job Engine for async computation — Supports future Ray/Spark integration for distributed metric computation.

Which issue(s) this PR fixes:

Partially Fixes #5919

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests
  • Integration tests
  • Operator unit tests (Ginkgo)

Test coverage (all passing):

Test Suite Count Covers
test_metrics_calculator.py 19 Numeric/categorical computation, edge cases (empty, all-null, single value, high cardinality), type classification, PyArrow type classification
test_monitoring_integration.py 16+ End-to-end batch/log computation, baseline flow, view/service aggregation, native storage dispatch, log feature name normalization, REST API endpoints, CLI, RBAC enforcement
repo_config_test.go 92 Operator repo config generation including DQM config with initial distribution disabled, YAML serialization verification

Snyk SAST scan: 0 vulnerabilities across all new files.

@jyejare jyejare requested a review from a team as a code owner March 31, 2026 10:53
@jyejare jyejare marked this pull request as draft March 31, 2026 10:54
devin-ai-integration[bot]

This comment was marked as resolved.

@jyejare jyejare force-pushed the monitoring_plus branch 4 times, most recently from d0b45bb to c06853e Compare April 21, 2026 14:00
@jyejare jyejare marked this pull request as ready for review April 21, 2026 14:34
@jyejare jyejare requested review from a team and sudohainguyen as code owners April 21, 2026 14:34
@jyejare jyejare requested review from lokeshrangineni, robhowley and tokoko and removed request for a team April 21, 2026 14:34
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment thread sdk/python/feast/infra/offline_stores/bigquery.py Outdated
Comment thread sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py Outdated
@jyejare jyejare changed the title feat: Add feature quality monitoring with statistical metrics, REST API, and CLI feat: Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config Apr 21, 2026
@jyejare jyejare changed the title feat: Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config feat: [Backend] Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config Apr 21, 2026
@jyejare jyejare changed the title feat: [Backend] Add feature quality monitoring with native offline store compute/storage, multi-backend support, REST API, CLI, and operator config feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI Apr 21, 2026
@jyejare jyejare force-pushed the monitoring_plus branch from 3da4dde to 0344087 Compare May 5, 2026 08:52
Comment thread sdk/python/feast/api/registry/rest/monitoring.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_utils.py Outdated
Comment thread sdk/python/feast/monitoring/dqm_job_manager.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_service.py Outdated
Comment thread sdk/python/feast/monitoring/monitoring_store.py Outdated
BatchEngine *BatchEngineConfig `json:"batchEngine,omitempty"`
// Dqm configures Data Quality Monitoring behaviour.
// +optional
Dqm *DqmConfig `json:"dqm,omitempty"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dqm is not user friendly, I think it's better to name it feature_monitoring or something else

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntkathole dqm is very known term in developers, data scientists. Also feature_monitoring is a global scope term which covers both operational metrics + data quality metrics. So I keep it limited to data quality.

I can rephrase it to DataQualityMonitoring if you think thats better ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, DataQualityMonitoring explicit is better than short name

(feast_feature_freshness_seconds)."""


class DqmInitialDistributionConfig(FeastConfigBaseModel):
Copy link
Copy Markdown
Member

@ntkathole ntkathole May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these configs should also live at top instead of feature_server. It uses the offline store, not the online server. This field is more similar to materialization, which is top-level config.

#feature_store.yaml        

feature_monitoring:   
  auto_baseline: false 

This matches the pattern: materialization: spans offline+online stores, openlineage: spans apply+materialize - feature_monitoring: spans offline store (compute/storage) + apply trigger + server API.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be due to the existing metrics config is residing under server config. The metrics(even operational metrics) could also be computed for offline store. So I think its good if we move both operational metrics and dqm metrics under the parent monitoring config.

Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
jyejare added 5 commits May 8, 2026 12:50
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
@jyejare jyejare force-pushed the monitoring_plus branch 6 times, most recently from d6c1162 to fd827d9 Compare May 8, 2026 15:20
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
@jyejare jyejare force-pushed the monitoring_plus branch from fd827d9 to c190315 Compare May 8, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revamp Data Quality Monitoring

2 participants