[FEATURE] created_by Provenance Tag Support in ML Commons

**Is your feature request related to a problem?**

The ML Commons stats framework (`MLStatsJobProcessor`) publishes adoption metrics for models, agents, and connectors as OTel counters with rich tags describing what was created: service provider, model type, deployment mode, etc. However, there is no way to attribute which plugin or caller provisioned a given resource. This makes it impossible to distinguish, in the metrics, between resources created by an automated plugin provisioning flow (e.g., Flow Framework plugin) vs. resources created directly by users via the API, or other plugins.

The `MachineLearningClient` interface (used by all plugins integrating with ML Commons) provides no mechanism to pass caller provenance. The underlying input objects, `MLCreateConnectorInput`, `MLRegisterModelInput`, and `MLAgent`, have no `created_by` field. The transport actions that persist these objects (`TransportCreateConnectorAction`, `TransportRegisterModelAction`, `TransportRegisterAgentAction`) never record provenance. And `MLModel.getTags()` / `MLAgent.getTags()` have no such dimension to emit.

As a concrete example: a plugin (like Flow Framework) that automates ML resource provisioning (connectors, models, agents) as a "one-and-done" setup step wants to measure how many users are in active continued use of the resources it provisioned, as distinct from resources provisioned by other means. This is currently impossible with the existing stats framework.

**What solution would you like?**

Add an optional `created_by` field as first-class metadata across the ML resource creation path, surfaced as a tag in the adoption metrics framework. The changes required span four areas:

1. Domain objects and input classes (common module)

Add `String createdBy` to `MLCreateConnectorInput`, `MLRegisterModelInput`, `MLAgent`, `and MLModel`. (Given that Connectors are currently not used in stats and have a tight relationship to models, we can leave them out.) Implement `toXContent`, `parse`, `writeTo`, and `StreamInput` constructors in each class, version-gated on a new `VERSION_X_Y_Z` constant following the existing pattern. 

2. Transport actions (plugin module)
   - `TransportRegisterModelAction`: copy `createdBy` from `MLRegisterModelInput` onto MLModel before indexing
   - `TransportRegisterAgentAction`: `MLAgent` is indexed directly, so no additional propagation is needed beyond Step 1

3. Tag emission (common module)

- `MLModel.getTags()`/ `getTags(Connector)`: add created_by tag to all three tag-building paths (remote, pre-trained, custom)
- `MLAgent.getTags()`: add `created_by` tag
 
4. Connector metrics in `MLStatsJobProcessor` (plugin module)

`AdoptionMetric.CONNECTOR_COUNT` is currently defined but never incremented. As part of this work, add connector collection to `MLStatsJobProcessor` parallel to the existing model collection, reading `created_by` from the stored connector document and emitting it as a tag. This completes coverage for all three resource types.

With these changes, a plugin provisioning ML resources via the ML Client simply sets the field on the input builder:
```java
MLCreateConnectorInput.builder()
    // ... existing fields ...
    .createdBy("my-plugin")
    .build();

MLRegisterModelInput.builder()
    // ... existing fields ...
    .createdBy("my-plugin")
    .build();

MLAgent.builder()
    // ... existing fields ...
    .createdBy("my-plugin")
    .build();
```

The stats framework then emits metrics like:

```
ml.commons.MODEL_COUNT{created_by="my-plugin", deployment="remote", service_provider="bedrock", type="llm", ...}
ml.commons.AGENT_COUNT{created_by="my-plugin", type="conversational", ...}
ml.commons.CONNECTOR_COUNT{created_by="my-plugin", service_provider="bedrock", ...}
```

**What alternatives have you considered?**

1. Using the existing `app_type` field on `MLAgent`: `MLAgent` already has an appType field, but it is a user-facing classification of the agent's functional purpose (e.g. "chatbot"), not a record of which plugin provisioned it. Overloading it for provenance would conflate two distinct concepts and would not cover connectors or models, which have no equivalent field.

2. Tagging via connector/model parameters: A plugin could embed a `created_by` key in the parameters map of a connector or model. However, this is an undocumented convention with no guarantee of surviving updates, no first-class support in `getTags()`, and no way to filter it out of functional parameters passed to the remote endpoint.

3. Tracking provenance outside ML Commons: The calling plugin could maintain its own index of resource IDs it provisioned and join that against ML Commons data at query time. This is fragile, requires the plugin to manage additional state, and produces metrics that are disconnected from the rich tag context (service provider, model type, etc.) that `MLStatsJobProcessor `already computes.

**Do you have any additional context?**

- `created_by` is purely informational metadata — a free-form string with no validation or enforcement by ML Commons. The framework does not need to know or care about the value.
- The field follows the exact version-gating pattern already used, ensuring backward compatibility in mixed-version clusters where older nodes simply ignore the field.
- `created_by` will be visible in `GET model/agent/connector` API responses, which is desirable for operator visibility into resource provenance. 
- This is not a security boundary. Any caller can set any value. It is not intended to replace or interact with the existing owner/user access control fields.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] created_by Provenance Tag Support in ML Commons #4752

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] created_by Provenance Tag Support in ML Commons #4752

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions