Introduction

NVIDIA Bright Cluster Manager is designed to enable rapid deployment and end-to-end lifecycle management of heterogeneous high-performance computing (HPC) and artificial intelligence (AI) infrastructures. It automates provisioning and administration for clusters ranging in size from a couple of nodes to hundreds of thousands, supports CPU-based and NVIDIA GPU-accelerated systems, and enables orchestration with Kubernetes.

Device types in Bright Cluster Manager:

A device in the cluster manager infrastructure represents components of a cluster. A device can be any of the following types:

  • Head Node
  • Physical Node
  • Virtual Node
  • Cloud Node
  • GPU Unit
  • Chassis
  • Ethernet Switch
  • InfiniBand Switch
  • Lite Node
  • Myrinet Switch
  • Power Distribution Unit
  • Rack Sensor Kit
  • Generic Device
  • Prism Central (PC)

The following helps you get started.

Key Use Cases

Discovery Use Cases

  • Discovery BCM elements.
  • Publishes relationships between resources to have a topological view and ease of maintenance.
  • Refer to the Resource Hierarchy for detailed structure.

Monitoring Use Cases

  • Provides monitoring related to availability, performance, capacity, Thermal, and usage.
  • Generates alerts when defined metric thresholds are breached, notifying users of potential issues.
  • Device monitoring collects time-based metric values and sends alerts to the designated customer team whenever thresholds are breached or unusual metric behavior is detected, as defined in the configurations. This process enables the customer to respond promptly and ensures smooth business operations with minimal or no downtime in the event of infrastructure-related issues.
  • For hierarchy details, refer to Resource Hierarchy.

Resource Hierarchy

    NVIDIA BrightCluster Manager
    • NVIDIA BCM Head Node
    • NVIDIA BCM Virtual Node
    • NVIDIA BCM Physical Node
    • NVIDIA BCM Linux Server

Version History

Application VersionBug fixes / Enhancements
5.0.7Provided a fix to fetch “Get Latest Metrics”.
5.0.6Support adding the Root Resource UUID as a custom attribute for Nvidia Bright Cluster Manager app.
5.0.5Activity Log, Get Latest Metrics and Debugging Changes for Nvidia Bright Cluster Manager.
5.0.4Fix provided related to component level threshold alerting.
5.0.3Fixed component level threshold alert enabling and disabling.
5.0.2Resource Display Order changes and Sub-Category changes.
Click here to view the earlier version updates
Application VersionBug fixes / Enhancements
5.0.1Curated DashBoard, cache flush changes.
5.0.0Added new Native Type NVIDIA BCM Linux Server and Metrics.
4.0.0Added support for nfs-server metrics on NVIDIA BCM Head Node.
3.0.0Added support for nfs-mount metrics on NVIDIA BCM Physical Node
2.0.0Supported new metric "nvidia_bcm_cluster_smartHdaTemp".
1.0.0Initial Support.