feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI#6202
feat: [Backend] DQM with native compute, multi-backend support, REST API, CLI#6202jyejare wants to merge 7 commits intofeast-dev:masterfrom
Conversation
4340dbb to
940a4af
Compare
d0b45bb to
c06853e
Compare
| BatchEngine *BatchEngineConfig `json:"batchEngine,omitempty"` | ||
| // Dqm configures Data Quality Monitoring behaviour. | ||
| // +optional | ||
| Dqm *DqmConfig `json:"dqm,omitempty"` |
There was a problem hiding this comment.
dqm is not user friendly, I think it's better to name it feature_monitoring or something else
There was a problem hiding this comment.
@ntkathole dqm is very known term in developers, data scientists. Also feature_monitoring is a global scope term which covers both operational metrics + data quality metrics. So I keep it limited to data quality.
I can rephrase it to DataQualityMonitoring if you think thats better ?
There was a problem hiding this comment.
yea, DataQualityMonitoring explicit is better than short name
| (feast_feature_freshness_seconds).""" | ||
|
|
||
|
|
||
| class DqmInitialDistributionConfig(FeastConfigBaseModel): |
There was a problem hiding this comment.
I think these configs should also live at top instead of feature_server. It uses the offline store, not the online server. This field is more similar to materialization, which is top-level config.
#feature_store.yaml
feature_monitoring:
auto_baseline: false
This matches the pattern: materialization: spans offline+online stores, openlineage: spans apply+materialize - feature_monitoring: spans offline store (compute/storage) + apply trigger + server API.
There was a problem hiding this comment.
This could be due to the existing metrics config is residing under server config. The metrics(even operational metrics) could also be computed for offline store. So I think its good if we move both operational metrics and dqm metrics under the parent monitoring config.
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
d6c1162 to
fd827d9
Compare
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Signed-off-by: Jitendra Yejare <11752425+jyejare@users.noreply.github.com>
What this PR does / why we need it:
This PR introduces comprehensive feature quality monitoring capabilities to Feast, enabling proactive tracking of feature distributions and data quality metrics. Currently, Feast has no built-in tools for monitoring feature health in production — ML teams must build custom solutions to detect issues like distribution shifts, elevated null rates, or degraded data quality before they silently impact model performance.
What it adds:
Core Monitoring Engine
OfflineStoreas the primary compute path, with a Python-based (PyArrow/NumPy) fallback for backends that don't implement native compute. This leverages the offline store as a compute engine (same architecture as Feast materialization).OfflineStorebackend itself (no separate monitoring database). Six static methods on theOfflineStorebase class (compute_monitoring_metrics,get_monitoring_max_timestamp,ensure_monitoring_tables,save_monitoring_metrics,query_monitoring_metrics,clear_monitoring_baseline) handle compute and storage.MetricsCalculator) — Backend-agnostic statistical computation as fallback, supporting:PrimitiveFeastTypeandValueTypeMulti-Backend Support (8 Offline Stores)
All 6 native monitoring methods implemented for each backend with dialect-specific SQL:
INSERT ON CONFLICTPERCENTILE_CONT,WIDTH_BUCKETMERGEwithVARIANTJSONAPPROX_PERCENTILE,WIDTH_BUCKETMERGEinto BQ tablesAPPROX_QUANTILES, parameterized queriesMERGEvia Data APIAPPROXIMATE PERCENTILE_DISCPERCENTILE_APPROX,spark.sql()MERGE FROM DUALPERCENTILE_CONT WITHIN GROUPQUANTILE_CONT,HISTOGRAMpyarrow.compute+numpyMulti-Granularity Time-Series Metrics
daily,weekly,biweekly,monthly,quarterlyBatch + Log Data Source Support
batch_sourceviaOfflineStore.pull_all_from_table_or_query()FeatureService.logging_configdestination, using__log_timestampas event timestampdriver_stats__conv_rate) are parsed back to their originalfeature_view_name+feature_namefor storage compatibility and drift detectiondata_source_typecolumn (batch/log) differentiates metrics in storageOrchestration Service (
MonitoringService)OfflineStoreinstance for performanceShared Utilities (
monitoring_utils.py)monitoring_table_meta(),opt_float(),empty_numeric_metric(),empty_categorical_metric(),normalize_monitoring_row(),build_view_aggregate()DQM Job Engine (
DQMJobManager)compute,baseline,auto_compute)feast_monitoring_jobstableREST API (
/monitoring/)POST/monitoring/computePOST/monitoring/auto_computePOST/monitoring/compute/transientPOST/monitoring/compute/logPOST/monitoring/auto_compute/logGET/monitoring/jobs/{job_id}GET/monitoring/metrics/featuresGET/monitoring/metrics/feature_viewsGET/monitoring/metrics/feature_servicesGET/monitoring/metrics/baselineGET/monitoring/metrics/timeseriesAll endpoints support cascading filters:
project,feature_service_name,feature_view_name,feature_name,granularity,data_source_type, date range.RBAC enforced using existing
AuthzedAction.DESCRIBE(read) andAuthzedAction.UPDATE(compute).CLI (
feast monitor run)Auto-Baseline on
feast applyfeast applyfeature_store.yaml:Feast Operator Support
DqmConfig,DqmDistributionConfig,DqmInitialDistributionConfigadded toFeatureStoreSpecfeature_server.dqmsection infeature_store.yamlwhen DQM config is setmake generateDocumentation
docs/how-to-guides/feature-monitoring.md— Production setup, CLI usage, REST API reference, orchestrator integration (Airflow, KFP, cron, K8s CronJob), backend compatibility tableexamples/monitoring/monitoring-quickstart.ipynb— 12-step hands-on walkthrough with visualization examplesdocs/SUMMARY.mdupdated with links to bothDesign decisions:
OfflineStorecompute + storage — Each backend implements its own SQL push-down for metrics calculation and uses its native UPSERT/MERGE for storage. No separate monitoring database needed./monitoring/route rather than extending existing/metrics/— The existing metrics route serves registry inventory metadata; monitoring serves statistical feature quality data with a different data path.Which issue(s) this PR fixes:
Partially Fixes #5919
Checks
git commit -s)Testing Strategy
Test coverage (all passing):
test_metrics_calculator.pytest_monitoring_integration.pyrepo_config_test.goSnyk SAST scan: 0 vulnerabilities across all new files.