Skip to content

Add alerts with webhooks, CLI, and documentation#439

Merged
abidlabs merged 20 commits into
mainfrom
alert
Mar 3, 2026
Merged

Add alerts with webhooks, CLI, and documentation#439
abidlabs merged 20 commits into
mainfrom
alert

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Feb 24, 2026

Summary

Adds a complete alerts system to Trackio. Alerts let users flag important events during training runs — they're printed to the terminal, stored in the database, displayed in the dashboard, and optionally sent to webhooks.

image

In Slack (check the #trackio-alerts channel internally), looks like this:

image

Basic Usage

import trackio

trackio.init(project="my-project", webhook_url="https://hooks.slack.com/services/T.../B.../xxx")

for epoch in range(100):
    loss = train(...)
    trackio.log({"loss": loss})

    if epoch > 10 and loss > 5.0:
        trackio.alert(
            title="Loss spike",
            text=f"Loss jumped to {loss:.2f} at epoch {epoch}",
            level=trackio.AlertLevel.ERROR,
        )

trackio.finish()

Using with Transformers / TRL

When using report_to="trackio", the TrackioCallback handles init/log/finish. To add alerts, pass a custom callback:

import trackio
from transformers import Trainer, TrainerCallback, TrainingArguments

class AlertCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if "trackio" not in args.report_to or logs is None:
            return
        if logs.get("loss", 0) > 5.0:
            trackio.alert(
                title="Training loss spike",
                text=f"loss={logs['loss']:.4f} at step {state.global_step}",
                level=trackio.AlertLevel.ERROR,
            )

    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        if "trackio" not in args.report_to or metrics is None:
            return
        if metrics.get("eval_loss", 0) > 2.0:
            trackio.alert(
                title="High eval loss",
                text=f"eval_loss={metrics['eval_loss']:.4f}",
                level=trackio.AlertLevel.WARN,
            )

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./output",
        report_to="trackio",
        project="my-project",
    ),
    train_dataset=train_dataset,
    callbacks=[AlertCallback()],
)
trainer.train()

Same pattern works with TRL trainers (GRPOTrainer, SFTTrainer, etc.):

class RLAlertCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if "trackio" not in args.report_to or logs is None:
            return
        if logs.get("train/reward", 0) < -1.0:
            trackio.alert(title="Reward collapse", level=trackio.AlertLevel.ERROR)
        if logs.get("train/kl", 0) > 10.0:
            trackio.alert(title="KL divergence too high", level=trackio.AlertLevel.WARN)

This PR was authored with AI assistance, but I tested and reviewed it myself.

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Feb 24, 2026

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
trackio minor

  • Add alerts with webhooks, CLI, and documentation

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

  • Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Feb 24, 2026

🪼 branch checks and previews

Name Status URL
🦄 Changes detected! Details

abidlabs and others added 3 commits February 24, 2026 09:56
Co-authored-by: Cursor <cursoragent@cursor.com>
@abidlabs abidlabs changed the title Add alerts Add alerts with webhooks, CLI, and documentation Feb 24, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Copy Markdown
Collaborator

Looking forward to use it with Slack!!

@abidlabs
Copy link
Copy Markdown
Member Author

just testing it rn @qgallouedec!

abidlabs and others added 5 commits February 24, 2026 12:04
Added details about Slack Block Kit messages in alerts documentation.
Removed an image and added a description of Slack Block Kit messages.
Co-authored-by: Cursor <cursoragent@cursor.com>
@abidlabs abidlabs marked this pull request as ready for review February 24, 2026 20:27
@abidlabs
Copy link
Copy Markdown
Member Author

abidlabs commented Feb 25, 2026

Also added an agent skill (/.agents/skills/trackio/) so that LLM coding agents (Cursor, Claude Code, etc.) can automatically discover and use Trackio when running ML experiments.

This same skill is also published to hf-skills as hugging-face-trackio so it can be installed by any agent.

@abidlabs
Copy link
Copy Markdown
Member Author

abidlabs commented Feb 25, 2026

Also added two capabilities for inspecting metrics at specific points in time — designed for the workflow where an alert fires and you (or an agent) need to quickly understand what happened.

trackio get metric now supports --step, --around, --at-time, and --window

Filter a single metric to a specific step or a window around a step/timestamp:

# Exact step
trackio get metric --project P --run R --metric loss --step 200 --json

# Window of ±10 steps (default) around step 200
trackio get metric --project P --run R --metric loss --around 200 --json

# Window of ±60 seconds around a timestamp
trackio get metric --project P --run R --metric loss --at-time "2025-06-01T12:05:30" --window 60 --json

trackio get snapshot

Returns all metrics at/around a step or timestamp in a single call. This is the fastest way to understand the full state of a run at a specific point:

trackio get snapshot --project P --run R --around 200 --window 5 --json

Returns:

{
  "project": "P",
  "run": "R",
  "around": 200,
  "window": 5,
  "metrics": {
    "loss": [{"step": 198, "value": 0.42}, {"step": 200, "value": 0.45}, ...],
    "accuracy": [{"step": 198, "value": 0.88}, {"step": 200, "value": 0.87}, ...],
    "lr": [{"step": 198, "value": 0.0001}, {"step": 200, "value": 0.0001}, ...]
  }
}

The typical agent workflow is: see alert at step N → inspect metrics around step N → decide to continue or adjust. Previously, the agent would need to fetch the entire metric history and filter client-side. Now it's a single CLI call.

@qgallouedec
Copy link
Copy Markdown
Collaborator

do you think we should also call gr.Warning? At first at was a bit surprised that nothing was showing on my dashboard

@qgallouedec
Copy link
Copy Markdown
Collaborator

Another question: did you consider having a dedicated “panel” within the Metrics view instead of creating a new tab? I’m wondering if switching tabs back and forth might become cumbersome when monitoring a run (ie, look at the curves while also keeping an eye on the latest alerts)

@abidlabs
Copy link
Copy Markdown
Member Author

abidlabs commented Feb 26, 2026

do you think we should also call gr.Warning? At first at was a bit surprised that nothing was showing on my dashboard

Another question: did you consider having a dedicated “panel” within the Metrics view instead of creating a new tab? I’m wondering if switching tabs back and forth might become cumbersome when monitoring a run (ie, look at the curves while also keeping an eye on the latest alerts)

Great feedback @qgallouedec! I've redesigned the UI in the Trackio dashboard, replacing the dedicated Alerts page with an Alerts box that appears on the bottom right of every page:

image

This box can be expanded to view the latest alerts or collapsed. (It only appears if there is at least 1 alert). Let me know what you think!

This box will show the alerts that have been generated since you launched the Trackio dashboard. You can also view the historical alerts by going to the Reports page.

@qgallouedec
Copy link
Copy Markdown
Collaborator

Awesome! I think it's a better design indeed. I will try it again today.

@abidlabs
Copy link
Copy Markdown
Member Author

abidlabs commented Mar 3, 2026

Will go ahead and merge this, will release tomorrow if you have time to take a look before then @qgallouedec but no pressure if not

@abidlabs abidlabs merged commit 18e9650 into main Mar 3, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants