Name	Name	Last commit message	Last commit date
parent directory ..
1click-dashboards-deployment	1click-dashboards-deployment
assets	assets
eks-managed-observability	eks-managed-observability
README-OS-grafana.md	README-OS-grafana.md
README-grafana-alerts.md	README-grafana-alerts.md
README.md	README.md
cluster-observability-os-grafana.yaml	cluster-observability-os-grafana.yaml
cluster-observability.yaml	cluster-observability.yaml
dcgm-metrics.csv	dcgm-metrics.csv
update-prometheus.sh	update-prometheus.sh

Prometheus & Grafana Observability

This directory provides comprehensive guides for deploying observability stacks to monitor GPU workloads across different AWS compute platforms.

Available Solutions

EKS Managed Observability

Monitor Amazon EKS GPU workloads using Amazon Managed Prometheus and Amazon Managed Grafana with ADOT Collector. This solution provides minimal in-cluster overhead and supports multi-cluster monitoring.

Best for: EKS clusters, multi-cluster monitoring, minimal resource overhead

SageMaker HyperPod Monitoring (This Document)

Monitor SageMaker HyperPod clusters with SLURM-exporter, DCGM-exporter, and EFA-node-exporter. This solution uses lifecycle scripts to bootstrap monitoring on HyperPod nodes.

Best for: SageMaker HyperPod with SLURM workload manager

SageMaker HyperPod Monitoring

This guide provides a comprehensive approach for deploying an observability stack tailored to enhance monitoring capabilities for your SageMaker HyperPod cluster. It demonstrates how to export both cluster metrics (SLURM-exporter) and node metrics (DCGM-exporter, EFA-node-exporter) to a Prometheus/Grafana monitoring stack. This setup enables your administrators, ML-ops teams, and model developers to access real-time metrics, offering valuable insights into your cluster's performance.

To get started, you will initiate the provisioning of an Amazon CloudFormation Stack within your AWS Account. You can find the complete stack template in cluster-observability.yaml. This CloudFormation stack will orchestrate the deployment of the following resources dedicated to cluster monitoring in your AWS environment:

If you are using an environment which does not allow to use IAM Identity Center or SAML, consider alternative OS grafana option.

The solution uses SageMaker HyperPod Lifecycle Scripts, to bootstrap your cluster with the following open-source exporter services:

Name	Script Deployment Target	Metrics Description
`0.Prometheus Slurm Exporter`	controller-node	SLURM Accounting metrics (sinfo, sacct)
`1.EFA-Node-Exporter`	cluster-nodes	Fork of Node exporter to include metrics from emitted from EFA
`2.NVIDIA-DCGM-Exporter`	cluster-nodes	Nvidia DCGM Metrics about Nvidia Enabled GPUs

Prerequisites

To enable these exporter services, modify the config.py file to configure enable_observability = True. Save this file, and upload it to the s3 bucket path referenced in your cluster-config.json file. By modifying config.py and uploading to S3, this will ensure that any new nodes added or replaced in the HyperPod cluster will also be created with the metric exporter scripts running

If you have already created your HyperPod cluster, you can follow these instructions to update your existing HyperPod cluster with Observability.

Important

Before proceeding, you will need to add the following AWS Managed IAM Policies to your AmazonSagemakerClusterExecutionRole:

AmazonPrometheusRemoteWriteAccess: this will give the control node access to write cluster metrics to the Amazon Managed Prometheus Workspace you will create.
AWSCloudFormationReadOnlyAccess this will give the install_prometheus.sh file permissions to read stack outputs (remotewriteurl, region) from your cloudformation stack

Deploy the CloudFormation Stack

1-Click Deploy

Alternatively, you can deploy OS Grafana stack.

1-Click Deploy

Important

It is strongly recommended you deploy this stack into the same region and same account as your SageMaker HyperPod Cluster.This will ensure successful execution of the Lifecycle Scripts, specifically install_prometheus.sh, which relies on AWS CLI commands that assume same account and same region.

Connect to the cluster

Connect to the controller node of your cluster via ssm:

Note

You can find the ClusterID, WorkerGroup, and Instance ID of your controller node in the SageMaker Console or via the AWS CLI

aws ssm start-session —target sagemaker-cluster:<CLUSTER_ID>_<WORKER_GROUP>-<INSTANCE_ID>

Verify the new prometheus config and service file created by install_prometheus.sh is running on the controller node:

sudo systemctl status prometheus

The output should show active (running):

You can validate the prometheus configuration file with:

cat /etc/prometheus/prometheus.yml

Your file should look similar to the following:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 15s

scrape_configs:
  - job_name: 'slurm_exporter'
    static_configs:
      - targets:
          - 'localhost:8080'
  - job_name: 'dcgm_exporter'
    static_configs:
      - targets:
          - '<ComputeNodeIP>:9400'
          - '<ComputeNodeIP>:9400'
  - job_name: 'efa_node_exporter'
    static_configs:
      - targets:
          - '<ComputeNodeIP>:9100'
          - '<ComputeNodeIP>:9100'

remote_write:
  - url: <AMPRemoteWriteURL>
    queue_config:
      max_samples_per_send: 1000
      max_shards: 200
      capacity: 2500
    sigv4:
      region: <Region>

You can curl for relevant Prometheus metrics on the controller nodes using:

curl -s http://localhost:9090/metrics | grep -E 'slurm|dcgm|efa'

With node and cluster metrics now being exported to Amazon Managed Prometheus Workspace via prometheus remote write from the control node, next you will set up the Amazon Managed Grafana Workspace.

Setup the Grafana Workspace

Important

Before proceeding, ensure your AWS Account has been setup with AWS Identity Center. It will be used to authenticate to the Amazon Managed Grafana Workspace in the final steps:

Navigate to Amazon Managed Grafana in the AWS Management Console

In the Authentication Tab, configure Authentication using AWS IAM Identity Center:

Note

Configure your AWS IAM Identity Center User as User type: Admin.

Within the DataSources Tab of your Grafana workspace, click the "Configure in Grafana" link to Configure Prometheus as a data source.

You will be prompted to authenticate to the Grafana workspace with the IAM Identity Center Username and Password. This is the user you set up for the workspace.

Note

If you have forgotten username and password, you can find and reset them within IAM Identity Center

Once you are in the Amazon Managed Grafana Workspace "datasources" page, select the AWS Region and Prometheus Workspace ID of your Amazon Managed Prometheus Workspace ID.

Build Grafana Dashboards

Finally, with authentication and data sources setup, within your grafana workspace, select dashboards > new > import.

To display metrics for the exporter services, you can start by configuring and customizing the following 3 open source Grafana Dashboards by copying and pasting the below links:

Congratulations, you can now view real time metrics about your Sagemaker HyperPod Cluster and compute nodes in Grafana!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Prometheus & Grafana Observability

Available Solutions

EKS Managed Observability

SageMaker HyperPod Monitoring (This Document)

SageMaker HyperPod Monitoring

Prerequisites

Deploy the CloudFormation Stack

Connect to the cluster

Setup the Grafana Workspace

Build Grafana Dashboards

Slurm Exporter Dashboard:

Node Exporter Dashboard:

DCGM Exporter Dashboard:

GPU Health (Xid) Dashboard:

GPU Health (Xid) by Node Dashboard:

FilesExpand file tree

4.prometheus-grafana

Directory actions

More options

Directory actions

More options

Latest commit

History

4.prometheus-grafana

Folders and files

parent directory

README.md

Prometheus & Grafana Observability

Available Solutions

EKS Managed Observability

SageMaker HyperPod Monitoring (This Document)

SageMaker HyperPod Monitoring

Prerequisites

Deploy the CloudFormation Stack

Connect to the cluster

Setup the Grafana Workspace

Build Grafana Dashboards

Slurm Exporter Dashboard:

Node Exporter Dashboard:

DCGM Exporter Dashboard:

GPU Health (Xid) Dashboard:

GPU Health (Xid) by Node Dashboard: