This guide provides a step-by-step process to diagnose and resolve high memory issues causing NextGen Gateway pods to crash in a Kubernetes environment. It includes commands to check the pod status, identify memory-related issues, and implement solutions to stabilize the pod.

Verify the Memory usage if Pod Crashes due to Memory Issue

To verify the memory usage in Kubernetes pods, make sure that you have enabled the metrics server in the Kubernetes cluster. Kubectl top command can be used to retrieve snapshots of resource utilization of pods or nodes in your Kubernetes cluster.

  • Use the below command to verify POD memory usage.

    $ kubectl top pods
    NAME           CPU(cores)   MEMORY(bytes)  
    nextgen-gw-0   48m          1375Mi

  • Use the below command to verify Node memory usage.

    $ kubectl top nodes
    NAME              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%  
    nextgen-gateway   189m         9%     3969Mi          49%

NextGen Gateway pod Crashed due to High Memory Usage

The NextGen Gateway pod in a Kubernetes cluster crashes due to high memory usage.

Possible Causes

When a pod exceeds its allocated memory, the Kubernetes system automatically kills the process to protect the node’s stability, resulting in an “OOMKilled” (Out of Memory Killed) error. This is particularly critical for the NextGen Gateway, as it may affect the stability and monitoring capabilities of the OpsRamp platform.

Troubleshooting Steps

Follow these steps to diagnose and fix memory issues for the NextGen Gateway pod:

  1. Check the status of Kubernetes objects to determine if pods are running or not.
  2. Use the following command to gather detailed information about the pod. This will include the status, restart count, and the reason for any previous restarts.
    kubectl describe pod <pod_name>
    For example:
    kubectl describe pod nextgen-gw-0
  3. Look for memory-related termination reasons in the pod’s event logs.

    Sample output of logs:
    vprobe:
        Container ID:   containerd://40c8585cf88dc7d0dd4e43560dc631ef559b0c92e6d5d429719a384aaea77777
        Image:          us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe:17.0.0
        Image ID:       us-central1-docker.pkg.dev/opsramp-registry/gateway-cluster-images/vprobe@sha256:8de1a98c3c14307fa4882c7e7422a1a4e4d507d2bbc454b53f905062b665e9d2
        Port:           <none>
        Host Port:      <none>
        State:          Running
          Started:      Mon, 29 Jan 2024 12:01:30 +0530
        Last State:     Terminated
          Reason:       OOMKilled
          Exit Code:    137
          Started:      Mon, 29 Jan 2024 12:00:42 +0530
          Finished:     Mon, 29 Jan 2024 12:01:29 +0530
        Ready:          True
        Restart Count:  1
  4. Confirm memory issue by Exit Code.
    • If the exit code is 137, then the pod is crashing due to memory issue.
  5. Fix the memory issue: