OpenTofu is now part of the Linux Foundation 🎉

Read more here →


Exit Code 137 – Fixing OOMKilled Kubernetes Error

OOMKilled Kubernetes Error (Exit Code 137)

In this article, we will look at the common OOMKilled error in Kubernetes, also denoted by exit code 137, learn what it means, and its common causes. More importantly, we will learn how to fix it!

We will cover:

  1. What is OOMKilled Kubernetes Error (exit code 137)?
  2. How does the OOMKiller mechanism work?
  3. OOMKilled (exit code 137) diagnosis
  4. How to fix exit code 137?

What is OOMKilled Kubernetes Error (Exit Code 137)?

When a container in a Kubernetes cluster exceeds its memory limit, it can be terminated by the Kubernetes system with an “OOMKilled” error, which indicates that the process was killed due to an out-of-memory condition. The exit code for this error is 137. If you hadn’t already guessed, OOM stands for ‘out-of-memory’!

The Status of your pods will show ‘OOMKilled’ if they encounter the error, which you can view using the command:

kubectl get pods

How Does the OOMKiller Mechanism Work?

The Out-Of-Memory Killer (OOMKiller) is a mechanism in the Linux kernel (not native Kubernetes) that is responsible for preventing a system from running out of memory by killing processes that consume too much memory. When the system runs out of memory, the kernel invokes the OOMKiller to choose a process to kill in order to free up memory and keep the system running.

The OOMKiller works by selecting the process that is consuming the most memory, and that is also considered to be the least essential to the system’s operation. This selection process is based on several factors, including the memory usage of the process, its priority level, and the amount of time it has been running.

Once the OOMKiller selects a process to kill, it sends a signal to the process, asking it to terminate gracefully. If the process does not respond to the signal, the kernel forcibly terminates the process and frees up its memory.

Note: A pod that is killed due to a memory issue is not necessarily evicted from a node if the restart policy on the node is set to “Always”. It will instead try to restart the pod.

The OOMKiller is a last-resort mechanism that is only invoked when the system is in danger of running out of memory. While it can help to prevent a system from crashing due to memory exhaustion, it is important to note that killing processes can result in data loss and system instability. As such, it is recommended to configure your system to avoid OOM situations, for example, by monitoring memory usage, setting resource limits, and optimizing memory usage in your applications.

Going under the hood, the Linux kernel maintains an oom_score for each process running on the host.  The chance that the process will be killed is based on how high the score is.

A oom_score_adj value allows users to customize the OOM process and define when processes should be terminated. Kubernetes uses the oom_score_adj value when defining a Quality of Service (QoS) class for a pod.

There are three QoS classes that can be assigned to a pod, each with a matching value for oom_score_adj:

  • Guaranteed: -997
  • BestEffort: 1000
  • Burstable: min(max(2, 1000 — (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Because pods with the Qos value of Guaranteed have a lower value of -997, they are the last to be killed on a node that is running out of memory. BestEffort pods are the first to be killed as they have the highest value of 1000.

To see the QoS class of a pod, run the following command:

Kubectl get pod -o jsonpath='{.status.qosClass}'

Run kubectl exec <podname> -it /bin/bash to connect to the pod.

To see the oom_score, run cat/proc//oom_score, and see the oom_score_adj, run cat/proc//oom_score_adj.

Read more about the kubectl exec command.

OOMKilled (Exit Code 137) Diagnosis

Step 1: Check the pod logs

The first step in diagnosing an OOMKilled error is to check the pod logs to see if there are any error messages that indicate a memory issue. The events section of the describe command will give further confirmation and the time/date the error occurred.

kubectl describe pod <podname>
State:          Running
       Started:      Fri, 12 May 2023 11:14:13 +0200
       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137

You can also interrogate the pod logs:

cat /var/log/pods/<podname>

Step 2: Monitor memory usage

Use Kubernetes monitoring tools such as Prometheus or Grafana to monitor memory usage in pods and containers. This can help you identify which containers are consuming too much memory and triggering the OOMKilled error.

Step 3: Use a memory profiler

Use a memory profiler such as pprof to identify memory leaks or inefficient code that may be causing excessive memory usage.

How to fix Exit Code 137?

Below are the common causes of the OOMKilled Kubernetes error and their resolutions.

  1. The container memory limit was reached.

This could be due to an inappropriate value being set on the memory limit value specified in the container manifest, this is the maximum amount of memory the container is allowed to use. It could also be due to the application experiencing a higher load than normal.

The resolution would be to increase the value of the memory limit or to investigate the root cause of the increased load and remediate it. Common causes of this include large file uploads, as uploading large files can consume a lot of memory resources, especially when multiple containers are running within a pod, and high traffic volumes from a sudden increase in traffic.

  1. The container memory limit was reached, as the application is experiencing a memory leak.

The application would need to be debugged to resolve the cause of the memory leak.

  1. The node is overcommitted.

This means the total memory used by pods is greater than the total node memory available. Increase the memory available to the node by scaling up, or move the pods to a node with more memory available.

You could also tweak the memory limits for your pods running on the overcommitted nodes so they fit within the available boundaries, note you should also pay attention to the memory requests setting, which specifies the minimum amount of memory a pod should use. If this is set too high, it might not be an efficient use of available memory.

When adjusting memory requests and limits, keep in mind that when a node is overcommitted, Kubernetes kills pods according to the following priority order:

    • Pods that do not have requests or limits.
    • Pods that have requests but not limits.
    • Pods that are using more than their memory request value — minimal memory specified — but under their memory limit.
    • Pods that are using more than their memory limit.

Check out also how to fix CreateContainerConfigError in Kubernetes.

Key Points

To avoid the OOMKilled error, it is recommended to monitor memory usage in Kubernetes pods and containers, set resource limits to prevent containers from consuming too much memory, and optimize application code to reduce memory consumption.

Additionally, consider increasing the memory resources allocated to the pod or using horizontal pod autoscaling to scale up the number of pods in response to increased workload demands.

We encourage you to also check out how Spacelift helps you manage the complexities and compliance challenges of using Kubernetes. Anything that can be run via kubectl can be run within a Spacelift stack. Find out more about how Spacelift works with Kubernetes, and get started on your journey by creating a free trial account.

The Most Flexible CI/CD Automation Tool

Spacelift is an alternative to using homegrown solutions on top of a generic CI. It helps overcome common state management issues and adds several must-have capabilities for infrastructure management.

Start free trial