Introduction
NVIDIA Bright Cluster Manager is designed to enable rapid deployment and end-to-end lifecycle management of heterogeneous high-performance computing (HPC) and artificial intelligence (AI) infrastructures. It automates provisioning and administration for clusters ranging in size from a couple of nodes to hundreds of thousands, supports CPU-based and NVIDIA GPU-accelerated systems, and enables orchestration with Kubernetes.
Device types in Bright Cluster Manager:
A device in the cluster manager infrastructure represents components of a cluster. A device can be any of the following types:
- Head Node
- Physical Node
- Virtual Node
- Cloud Node
- GPU Unit
- Chassis
- Ethernet Switch
- InfiniBand Switch
- Lite Node
- Myrinet Switch
- Power Distribution Unit
- Rack Sensor Kit
- Generic Device
- Prism Central (PC)
The following helps you get started.
To explore monitored metrics, see Supported Metrics and Default Monitoring Configuration.
To verify prerequisites and configure, see Working with NVIDIA Bright Cluster Manager.
Key Use Cases
Discovery Use Cases
- Discovery BCM elements.
- Publishes relationships between resources to have a topological view and ease of maintenance.
- Refer to the Resource Hierarchy for detailed structure.
Monitoring Use Cases
- Provides monitoring related to availability, performance, capacity, Thermal, and usage.
- Generates alerts when defined metric thresholds are breached, notifying users of potential issues.
- Device monitoring collects time-based metric values and sends alerts to the designated customer team whenever thresholds are breached or unusual metric behavior is detected, as defined in the configurations. This process enables the customer to respond promptly and ensures smooth business operations with minimal or no downtime in the event of infrastructure-related issues.
- For hierarchy details, refer to Resource Hierarchy.
Resource Hierarchy
- NVIDIA BrightCluster Manager
- NVIDIA BCM Head Node
- NVIDIA BCM Virtual Node
- NVIDIA BCM Physical Node
- NVIDIA BCM Linux Server
Version History
| Application Version | Bug fixes / Enhancements |
|---|---|
| 5.0.7 | Provided a fix to fetch “Get Latest Metrics”. |
| 5.0.6 | Support adding the Root Resource UUID as a custom attribute for Nvidia Bright Cluster Manager app. |
| 5.0.5 | Activity Log, Get Latest Metrics and Debugging Changes for Nvidia Bright Cluster Manager. |
| 5.0.4 | Fix provided related to component level threshold alerting. |
| 5.0.3 | Fixed component level threshold alert enabling and disabling. |
| 5.0.2 | Resource Display Order changes and Sub-Category changes. |
Click here to view the earlier version updates
| Application Version | Bug fixes / Enhancements |
|---|---|
| 5.0.1 | Curated DashBoard, cache flush changes. |
| 5.0.0 | Added new Native Type NVIDIA BCM Linux Server and Metrics. |
| 4.0.0 | Added support for nfs-server metrics on NVIDIA BCM Head Node. |
| 3.0.0 | Added support for nfs-mount metrics on NVIDIA BCM Physical Node |
| 2.0.0 | Supported new metric "nvidia_bcm_cluster_smartHdaTemp". |
| 1.0.0 | Initial Support. |