This document gives a basic overview of platform concepts and architectural views.

The Concepts section introduces the vocabulary, key principles, and organizing concepts that guide your understanding of the workflow and business processes the system supports.

The Component model section shows the logical relationship of functional elements.

The Process model section gives a high-level, user perspective of the basic operations involved in enterprise management.

The Deployment model gives a physical representation of deployed components and features.

The documentation information architecture emulates the Component model. For each platform- and solution-layer component, the documentation provides both reference information, such as detailed metrics descriptions, and related user interface guides. Future editions will focus on operational workflows, including tutorials, how-tos, and best practices.

Concepts

Access controls

Access controls provide a mechanism for authorizing user access to the platform and involves:

  • Authenticating users using one of the following authentication options:

    • Native user management and authentication
    • Single Sign-On (SSO)
    • Two-factor authentication
  • Role-based Access Control (RBAC) to grant users permissions based on their assigned role.

Using RBAC, you can control which actions an authenticated user is permitted, restricting access to:

  • Resources a user can manage, such as only manage network resources.
  • Credentials to which a user has access, such as non-administrator credentials on servers.
  • The actions a user is permitted to take, such as accessing remote consoles.
  • The locations, or domains, from which a user is permitted access to the platform.

Agents and gateways

Agents and gateway are distributed platform components that discover and monitor infrastructure resources. Agents monitor servers and applications, and gateways monitor non-server devices, such as network and storage devices.

An agent is an executable application that runs on managed resources within both on-premise and cloud infrastructures. The agent provides:

  • Hybrid Infrastructure Discovery for

    • on-premise data center
    • private cloud
    • virtualization
    • public clouds
    • synthetics
    • network
    • storage
  • Granular asset information

  • In-depth monitoring

  • OS and performance monitoring

  • Application monitoring

  • Availability monitoring

  • Dashboards

  • Custom Monitors

  • Auditable remote access using RDP and SSH

  • Patch management

  • Run book automation scripting

  • Scheduled jobs

  • Offline data storage

  • SNMP trap receiver

  • Report data

A gateway is a virtual appliance that provides secure communication, non-server resource monitoring, and limited data storage in the event of connectivity failure, including:

  • Hybrid Infrastructure Discovery for

    • on-premise data center
    • private cloud
    • virtualization
    • public clouds
    • synthetics
    • network
    • storage
  • Granular asset information

  • In-depth monitoring

  • OS and performance monitoring

  • Application monitoring

  • Availability monitoring

  • Synthetics with URL monitoring

  • Dashboards

  • Ping check

  • Custom monitors

  • Auditable remote access using RDP and SSH

  • Patch management

  • Scheduled jobs

  • Network configuration backup

  • Offline data storage

  • API adaptors for vCenter and storage

  • SMI-S adapters for storage

  • SNMP trap receiver

  • Integrations with third-party tools

  • Report data

Automation

You can use automation to automatically act on resource faults, remediating issues in response to events, or performing routine maintenance tasks. There are two automation models:

Automate discrete tasksUse this model for a single task that needs to be executed on multiple servers.
Automate a sequence of tasksUse this model to execute a sequence of tasks across multiple resources. This model is called a process automation workflow.

Availability

An up/down state indicates resource availability for providing the prescribed service. Evaluating metrics or using a simple acknowledgment from a resource can be used to determine the up/down state of the resource.

Correlation

Alerts that can be inferred to be due to the same cause are automatically grouped into similar types and two types of alert correlation are performed:

DeduplicationRepeated alerting occurs for an alert that is currently unresolved, such as network devices sending SNMP traps for as long as an issue persists. Repeated alerts are deduplicated.
InferencingDifferent alerts originating from different IT resources but it can infer the alerts are likely due to the same cause.

Dashboard

A dashboard is a collection of widgets that provide visualizations of collected metrics.

Partner-scoped dashboards are visible only to users defined for the partner. Client-scoped dashboards are only visible to users who are client members.

Discovery

Discovery is the process of finding resources deployed in the enterprise. Resources need to be discovered before they can be monitored and metrics collected. When discovering resources, a model that includes all resources is dynamically built and is used to interpret and present the state of the environment.

Event management

Events are activities of operational significance that occur on a monitored resource. Examples of events include:

  • Hardware failures
  • Server CPU utilization thresholds exceeded
  • Application failures
  • Configuration change

The following mechanisms are used to detect events:

  • Native instrumentation
  • Self-diagnostics
  • Third-party reporting by integrated third-party monitors

The goal of event management is to minimize the time spent responding to an event. The following event management lifecycle standardizes and automates the efficient handling of events:

  1. Ingestion
  2. Interpretation
  3. Correlation
  4. First Response
  5. Escalation

First response

The initial alert response can be governed by:

  • Inferred seasonal patterns, so the alert might be automatically suppressed if it remains open past a historical norm.
  • Learning algorithms, which can be trained to suppress alerts that match specific patterns.

Metric threshold

Metrics can be evaluated against threshold limits. Two types of thresholds are supported. A static threshold is a fixed value that represents a fault condition when exceeded. A change-based threshold is a value computed automatically that measures unexpected changes in the threshold value. Change-based thresholds are more applicable to metrics where a static value is difficult to determine.

Monitoring

The goal of monitoring is to assess the availability and performance of managed resources. This is done by collecting, storing, and evaluating resource metrics.

Performance

Resource performance is the measure of whether the resource is operating within user-defined limits. Fault conditions such as exceeding predefined thresholds can indicate performance issues.

Service maps

Service maps organize resources into a hierarchical structure. This makes it possible to associate resource health with the level of user and business impact.

Tenancy

Tenancy divides the enterprise into independent management domains, called tenants, where each tenant is a logical container of managed resources. Dashboards, management policies, and integrations are scoped to a tenant.

The tenancy model defines two core constructs:

  • A partner is a master tenant and is associated with your account.
  • A client is a partner sub-tenant. Different management policies can be applied to different clients.

Partners and clients can each have separate sets of user accounts and a user account can be part of one and only partner or client tenant.

User privileges within a tenant can be specified using the following RBAC criteria:

UserAn account within a tenant.
User GroupA group of users.
PermissionAuthorization controls limiting user access and activities.
RoleAn association of a user or user group with permissions against managed resources. A user or user group can be permitted specific actions on specific resources.

Topology maps

A topology map is automatically built from relationships determined during discovery. Each node in a topology map represents a managed resource and an edge between nodes represents the type of connection between those resources. With a topology map, you can visualize and explore your infrastructure, drilling down to an increasingly greater level of detail. Topology maps can also be used to model the impact of planned changes.

Component model

The following figure represents the basic system components or building blocks, which cooperate to implement the ITOM feature set:

Documentation Information Model

The arrows indicate a generalized workflow. Resources are discovered and resource management is updated accordingly. Monitoring, using managed-resource information, scans or waits for resource fault/recovery alerts, and forwards any alert condition for alert correlation. The alert is resolved to a context-sensitive event and alert management applies the management logic to remediate the alert condition, using automation to take the appropriate response.

Integrations are provided for the platform layer and for discovery and monitoring and event management, in the solution layer, to provide interactivity with external devices and services, extending the functionality, compatibility, and scalability of the core platform.

Platform layer

The platform layer implements the core functionality on which the higher-level functions of the solution layer are built. In general, configuration and policy are set in the platform layer and govern the operation of the solution layer.

The following integrations are provided to support platform functionality:

  • Password Management
  • SSO
  • Duo Security
  • Stream exports

In addition to the elements supporting enterprise resource management,

  • resource management
  • dashboards
  • ticketing
  • reporting

the platform layer supports a multi-tenancy model for managing system accounts and users, which is a key construct for scoping the platform. Tenancy partitions the platform in a hierarchical arrangement of a partner entity with multiple client entities, each client hosting multiple users. Authorizations, permissions, and roles provide complete, user-based management functionality, shown as Users, Groups, and RBAC in the figure.

Agent and gateway components enable distributed operation in a cloud environment. Coresident with the service or network resource they monitor, they aggregate and forward data from the managed devices to the cloud. They can also be configured to run automated scripts and enter other housekeeping functionality.

Finally, the API provides a full-featured REST interface to automate management operations at scale. The following figure generalizes the API interface:

API

Solution layer

The solution layer, building on the services of the platform later, implements the functionality needed for ITOM and IAOps. It consists of hybrid discovery and monitoring, event and incident management, and remediation and automation.

Integrations are provided to support the functionality of each of these areas:

  • discovery and monitoring integrations:

    • Public cloud
    • Cloud native
    • Compute
    • Data exports
    • Network
    • Storage
  • event management integrations:

    • Collaboration
    • Configuration automation
    • Custom integration
    • Patch management
    • Third-party events
    • Ticketing and ITSM

Hybrid discovery and monitoring

A broad range of IT resources across datacenter, public cloud, and cloud native environments can be discovered and monitored with agent-based and agentless monitors. These include:

  • Datacenter applications, URLs, containers, servers, and network resources.
  • Public cloud environments of compute instances, databases, load balancers, and PaaS services.
  • Cloud native environments with containers and orchestrators.

Built-in monitors are provided that capture availability and performance metrics and observer optimal threshold limits for supported resources. You can extend the platform to monitor any kind of IT resource by writing custom monitor scripts.

Event and incident management

Events represent business-impacting issues that require a response. Event and incident management uses escalation policies to aggregate, interpret, and act on events detected by monitors, resource diagnostics, and third-party integrations.

Using service maps, you can visualize the relationship between monitored resources and assess business and user impact based on resource health.

Event interpretation and response can be automated. Automation correlates and suppresses alerts, notifies users, and creates incident tickets for alerts that need operator intervention.

Remediation and automation

Event remediation and automation can also be automated by composing workflows to handle events. This includes SMS, voice, and email notification. Remote SSH is also supported for alert resolution.

Process model

The process model describes how platform elements interoperate to realize full ITOM functionality. The following figure shows the generalized workflow sequence and functional areas involved, from account set up to interactive resource management:

Tenancy Model

Strategically, the process model flows from design or planning activities to implementation activities. Design activities are shown in step 1 of the diagram but also include the planning activities of step 2, which involve an understanding of best practices and your particular infrastructure requirements. Implementation activities are shown in the hands-on, configuration work of steps 2 and 3. During design, you assess organizational capabilities, assets, and requirements and create a management environment that meets business needs. You then implement policies and processes that satisfy the design criteria.

Design activities result in partitioning, populating, and configuring the platform according to requirements for supporting the managed resources. Design elements include:

  • Partitioning your multi-tenant environment into client entities that match your organizational structure.
  • Defining users, and assigning permissions, and access-level roles to control system access.
  • Specifying the kind of instrumentation needed, depending on the resources that need to be managed.
  • Grouping resources so they can be managed as a class instead of individually.

Implementation activities that realize design goals include:

  • Defining management policies that automate actions performed on the resource when the resource is discovered.
  • Collecting credentials is needed to discover and monitor resources.
  • Finding and onboarding managed resources, employing the available integrations.
  • Validating that resources are successfully onboarded, using dashboards and reports.

In operation, discovered resources are monitored according to defined resource management policies. The following figure gives a more detailed view of the platform and solution components involved in the operational workflow:

E&R Workflow

When an alerting event occurs, one that satisfied the management policy criteria, it is handled by the event management and remediation subsystem. The following figure shows a more detailed representation of event management and remediation components and workflow:

E&R Workflow

Event management involves the following actions, depending on context:

  • Event aggregation using:

    • monitoring templates that generate alerts.
    • a third-party monitoring integration.
  • Event de-duplication.

  • Event suppression for unwanted alerts.

  • Event correlation, which correlates similarity-based events and co-occurrence-based events, using machine learning.

The following figure shows how templates are applied to managed resource metric data to classify and handle alerts:

Templates

Monitoring templates are specific to the monitored resource.

Event remediation involves notification using the reporting and ticketing systems. The process definition feature provides for automated event remediation, with or without operator intervention.

Deployment model

This section describes the platform deployment models for account management, tenancy, and the deployment of entities in a distributed cloud-agent-gateway environment.

Tenancy

Tenancy is a key construct for managing the platform, itself. The following figure represents the elements of tenancy:

Tenancy Model

The partner account partitions resources and capabilities among clients, which are typically business units. Partners and clients enter the notion of scope for users who are granted access rights rights within the partner or client scope.

The following figure show a further iteration on user access rights:

RBAC

Users can be grouped into user groups. Users and user groups can be assigned roles, which have certain access rights to resources depending on role requirements. Access level is defined by the permission set defined for the resource, which can be a subset of:

  • Manage
  • Create
  • View
  • Edit

The following example shows user groups assigned to either a Server Administrator or Network Administrator role with different resource access permissions, depending on the role:

RBAC Example

Hybrid cloud and distributed architecture

A cloud environment involves the deployment of agents and gateways in the proximity of managed resources, or integrations with other domains for direct interaction with the cloud:

Deployment Model

The figure shows servers communicating with an agent, which interfaces with a gateway to access the cloud. Network and other non-server resources communicate directly with the gateway to interface with the cloud. Synthetics and public cloud integrations also provide a cloud interface in addition to APIs.

Agents and gateways can be deployed independently or in tandem to enter the connectivity between managed resources and the cloud.

Agent

The agent is an executable application that runs on managed Windows and Linux devices, such as servers, desktop, and laptops. Agents can communicate with the cloud using a gateway and also directly or using a customer-owned proxy server.

Agent-based deployment is on:

  • Physical and private cloud servers

    • Windows
    • Linux
  • Public cloud instances

    • Windows
    • Linux
  • Kubernetes Instances

The following installation methods are provided, depending on operating system:

  • Windows

    • Deployment using a provided utility
    • Deployment using group policy
    • Deployment using a customer-owned automation tool, such as Chef, Puppet, Ansible, or any other orchestration tool
  • Linux

    • A feature that installs agents on Linux from the cloud
    • Installation using a Linux deployment script
    • Deployment through a customer-owned automation tool, such as Chef, Puppet, Ansible, or any other orchestration tool

Gateway

A Gateway is a virtual machine that manages network devices, such as switches, routers, firewalls, load balancers, and appliances, storage devices, virtual environments, and applications, such as Weblogic and Websphere:

Deployment Model

The following gateway deployment models are supported:

  • OVA mode for VMWare environments
  • ISO mode for hybrid IT models
  • AMI for deployment in the AWS cloud
  • VHD for deployment in the Azure cloud
  • IMAGE for deployment in the Google Cloud Platform