Elevating IaC Workflows with Spacelift Stacks and Dependencies 🛠️

Register for the July 23 demo →

Kubernetes

Exit Code 137 – Fixing OOMKilled Kubernetes Error

OOMKilled Kubernetes Error (Exit Code 137)

In this article, we will look at the common OOMKilled error in Kubernetes, also denoted by exit code 137, learn what it means, and its common causes. More importantly, we will learn how to fix it!

We will cover:

  1. What is OOMKilled Kubernetes Error (exit code 137)?
  2. How does the OOMKiller mechanism work?
  3. Exit Code 137 common causes
  4. OOMKilled (exit code 137) diagnosis
  5. How to fix exit code 137?
  6. Preventing OOMKilled errors

What is OOMKilled Kubernetes Error (Exit Code 137)?

A 137 exit code in Kubernetes indicates that a process was forcibly terminated. In Unix and Linux systems, when a process is terminated due to a signal, the exit code is determined by adding 128 to the signal number. In this case, the signal number is 9, which means “SIGKILL”, so by adding 9 to 128, you get the 137 exit code.

When a container in a Kubernetes cluster exceeds its memory limit, it can be terminated by the Kubernetes system with an “OOMKilled” error, which indicates that the process was killed due to an out-of-memory condition. The exit code for this error is 137. If you hadn’t already guessed, OOM stands for ‘out-of-memory’!

The Status of your pods will show ‘OOMKilled’ if they encounter the error, which you can view using the command:

kubectl get pods

How Does the OOMKiller Mechanism Work?

The Out-Of-Memory Killer (OOMKiller) is a mechanism in the Linux kernel (not native Kubernetes) that is responsible for preventing a system from running out of memory by killing processes that consume too much memory. When the system runs out of memory, the kernel invokes the OOMKiller to choose a process to kill in order to free up memory and keep the system running.

The OOMKiller works by selecting the process that is consuming the most memory, and that is also considered to be the least essential to the system’s operation. This selection process is based on several factors, including the memory usage of the process, its priority level, and the amount of time it has been running.

Once the OOMKiller selects a process to kill, it sends a signal to the process, asking it to terminate gracefully. If the process does not respond to the signal, the kernel forcibly terminates the process and frees up its memory.

Note: A pod that is killed due to a memory issue is not necessarily evicted from a node if the restart policy on the node is set to “Always”. It will instead try to restart the pod.

The OOMKiller is a last-resort mechanism that is only invoked when the system is in danger of running out of memory. While it can help to prevent a system from crashing due to memory exhaustion, it is important to note that killing processes can result in data loss and system instability. As such, it is recommended to configure your system to avoid OOM situations, for example, by monitoring memory usage, setting resource limits, and optimizing memory usage in your applications.

Going under the hood, the Linux kernel maintains an oom_score for each process running on the host.  The chance that the process will be killed is based on how high the score is.

A oom_score_adj value allows users to customize the OOM process and define when processes should be terminated. Kubernetes uses the oom_score_adj value when defining a Quality of Service (QoS) class for a pod.

There are three QoS classes that can be assigned to a pod, each with a matching value for oom_score_adj:

  • Guaranteed: -997
  • BestEffort: 1000
  • Burstable: min(max(2, 1000 — (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Because pods with the Qos value of Guaranteed have a lower value of -997, they are the last to be killed on a node that is running out of memory. BestEffort pods are the first to be killed as they have the highest value of 1000.

To see the QoS class of a pod, run the following command:

Kubectl get pod -o jsonpath='{.status.qosClass}'

Run kubectl exec <podname> -it /bin/bash to connect to the pod.

To see the oom_score, run cat/proc//oom_score, and see the oom_score_adj, run cat/proc//oom_score_adj.

Read more about the kubectl exec command.

Exit Code 137 Common Causes

There are usually two causes that bring up a 137 exit code:

  • The first and most common one is related to resource limits. In this case, usually, Kubernetes exceeds its allocated memory limit for a container, and when that happens, it will terminate it to ensure the stability of the node.
  • The other case is related to manual intervention – A user or script might send a ‘SIGKILL’ signal to a container process, leading to this exit code.

OOMKilled (Exit Code 137) Diagnosis

Step 1: Check the pod logs

The first step in diagnosing an OOMKilled error is to check the pod logs to see if there are any error messages that indicate a memory issue. The events section of the describe command will give further confirmation and the time/date the error occurred.

kubectl describe pod <podname>
State:          Running
       Started:      Fri, 12 May 2023 11:14:13 +0200
       Last State:   Terminated
       Reason:       OOMKilled
       Exit Code:    137
       ...

You can also interrogate the pod logs:

cat /var/log/pods/<podname>

Step 2: Monitor memory usage

Use Kubernetes monitoring tools such as Prometheus or Grafana to monitor memory usage in pods and containers. This can help you identify which containers are consuming too much memory and triggering the OOMKilled error.

Step 3: Use a memory profiler

Use a memory profiler such as pprof to identify memory leaks or inefficient code that may be causing excessive memory usage.

How to fix Exit Code 137?

Below are the common causes of the OOMKilled Kubernetes error and their resolutions.

  1. The container memory limit was reached.

This could be due to an inappropriate value being set on the memory limit value specified in the container manifest, this is the maximum amount of memory the container is allowed to use. It could also be due to the application experiencing a higher load than normal.

The resolution would be to increase the value of the memory limit or to investigate the root cause of the increased load and remediate it. Common causes of this include large file uploads, as uploading large files can consume a lot of memory resources, especially when multiple containers are running within a pod, and high traffic volumes from a sudden increase in traffic.

  1. The container memory limit was reached, as the application is experiencing a memory leak.

The application would need to be debugged to resolve the cause of the memory leak.

  1. The node is overcommitted.

This means the total memory used by pods is greater than the total node memory available. Increase the memory available to the node by scaling up, or move the pods to a node with more memory available.

You could also tweak the memory limits for your pods running on the overcommitted nodes so they fit within the available boundaries, note you should also pay attention to the memory requests setting, which specifies the minimum amount of memory a pod should use. If this is set too high, it might not be an efficient use of available memory.

When adjusting memory requests and limits, keep in mind that when a node is overcommitted, Kubernetes kills pods according to the following priority order:

    • Pods that do not have requests or limits.
    • Pods that have requests but not limits.
    • Pods that are using more than their memory request value — minimal memory specified — but under their memory limit.
    • Pods that are using more than their memory limit.

Check out also how to fix CreateContainerConfigError and ImagePullBackOff error in Kubernetes.

Preventing OOMKilled Errors

There are a couple of ways in which you can prevent OOMKilled errors:

  1. Set appropriate memory limits — The maximum amount of memory a container is allowed to use shouldn’t be lower than your default workflow memory consumption. For that, you will need to use metrics and monitoring to determine the typical memory usage of your application. Overestimating can lead to higher costs due to inefficient resource utilization (in this case, you must expand your nodes), but underestimating leads to frequent OOMKilled errors.
  2. Horizontal pod autoscaling — It is best practice to leverage the Kubernetes HPA Horizontal Pod Autoscaler to automatically increase the number of pod replicas when the memory demand is high for applications that can be scaled horizontally.
  3. Node resource allocation — Ensure your node has enough resources to handle workloads.
  4. Optimize application memory usage — Monitor your application and refactor it, if possible, to reduce memory consumption.
  5. Avoid memory leaks in your application — On the application side, you should regularly check and fix memory leaks.

Key Points

To avoid the OOMKilled error, it is recommended to monitor memory usage in Kubernetes pods and containers, set resource limits to prevent containers from consuming too much memory, and optimize application code to reduce memory consumption.

Additionally, consider increasing the memory resources allocated to the pod or using horizontal pod autoscaling to scale up the number of pods in response to increased workload demands.

We encourage you to also check out how Spacelift helps you manage the complexities and compliance challenges of using Kubernetes. Anything that can be run via kubectl can be run within a Spacelift stack. Find out more about how Spacelift works with Kubernetes, and get started on your journey by creating a free trial account.

The Most Flexible CI/CD Automation Tool

Spacelift is an alternative to using homegrown solutions on top of a generic CI. It helps overcome common state management issues and adds several must-have capabilities for infrastructure management.

Start free trial

The Practitioner’s Guide to Scaling Infrastructure as Code

Transform your IaC management to scale

securely, efficiently, and productively

into the future.

ebook global banner
Share your data and download the guide