NVIDIA Bright Cluster Manager

Introduction

NVIDIA Bright Cluster Manager is designed to enable rapid deployment and end-to-end lifecycle management of heterogeneous high-performance computing (HPC) and artificial intelligence (AI) infrastructures. It automates provisioning and administration for clusters ranging in size from a couple of nodes to hundreds of thousands, supports CPU-based and NVIDIA GPU-accelerated systems, and enables orchestration with Kubernetes.

Device types in Bright Cluster Manager:

A device in the cluster manager infrastructure represents components of a cluster. A device can be any of the following types:

Head Node
Physical Node
Virtual Node
Cloud Node
GPU Unit
Chassis
Ethernet Switch
InfiniBand Switch
Lite Node
Myrinet Switch
Power Distribution Unit
Rack Sensor Kit
Generic Device
Prism Central (PC)

The following helps you get started.

To explore monitored metrics, see Supported Metrics and Default Monitoring Configuration.
To verify prerequisites and configure, see Working with NVIDIA Bright Cluster Manager.

Key Use Cases

Discovery Use Cases

Discovery BCM elements.
Publishes relationships between resources to have a topological view and ease of maintenance.
Refer to the Resource Hierarchy for detailed structure.

Monitoring Use Cases

Provides monitoring related to availability, performance, capacity, Thermal, and usage.
Generates alerts when defined metric thresholds are breached, notifying users of potential issues.
Device monitoring collects time-based metric values and sends alerts to the designated customer team whenever thresholds are breached or unusual metric behavior is detected, as defined in the configurations. This process enables the customer to respond promptly and ensures smooth business operations with minimal or no downtime in the event of infrastructure-related issues.
For hierarchy details, refer to Resource Hierarchy.

Resource Hierarchy

NVIDIA BCM Head Node
NVIDIA BCM Virtual Node
NVIDIA BCM Physical Node
NVIDIA BCM Linux Server

Version History


Application Version	Bug fixes / Enhancements
5.1.0	This implementation introduces third-party CI alert mapping and OpsQL-based enhancements. Configure alert mappings to target CI systems through the application configuration. After configuration, the system automatically forwards alerts to the corresponding third-party platforms and maps them to the specified CIs, ensuring consistent integration and efficient alert management. Previously, resource filters in the app configuration required manual entry of resource core and custom attributes. With this enhancement, the configuration is moved to OpsQL-based filtering, where users can see the keys auto-populated as needed.
5.0.7	Provided a fix to fetch `â€œGet Latest Metricsâ€`.
5.0.6	Support adding the Root Resource UUID as a custom attribute for Nvidia Bright Cluster Manager app.
5.0.5	Activity Log, Get Latest Metrics and Debugging Changes for Nvidia Bright Cluster Manager.
5.0.4	Fix provided related to component level threshold alerting.
5.0.3	Fixed component level threshold alert enabling and disabling.
5.0.2	Resource Display Order changes and Sub-Category changes.

Click here to view the earlier version updates


Application Version	Bug fixes / Enhancements
5.0.1	Curated DashBoard, cache flush changes.
5.0.0	Added new Native Type NVIDIA BCM Linux Server and Metrics.
4.0.0	Added support for nfs-server metrics on NVIDIA BCM Head Node.
3.0.0	Added support for nfs-mount metrics on NVIDIA BCM Physical Node
2.0.0	Supported new metric "nvidia_bcm_cluster_smartHdaTemp".
1.0.0	Initial Support.