The Practitioner’s Guide to Scaling Infrastructure as Code

➡️ Download Now

Kubernetes

15 Common Kubernetes Pitfalls & Challenges

15 Common Kubernetes Pitfalls

Kubernetes is the most popular orchestrator for container deployment and management. It equips you with powerful tools to reliably run containerized apps in production.

With flexibility comes complexity, however. Kubernetes includes its own concepts, terms, and object types for modeling your application. Choosing when to use different components can be confusing to newcomers and experienced users alike because the effects of your decisions aren’t always easy to anticipate.

In this article, we’ll explore 15 common Kubernetes pitfalls which many teams encounter. Being able to recognize and avoid these challenges will improve your app’s scalability, reliability, and security while giving you more control over your cluster and its deployments.

  1. Deploying Containers With the “Latest” Tag
  2. Not Using Liveness and Readiness Probes
  3. Broken Pod Affinity/Anti-Affinity Rules
  4. Forgetting Network Policies
  5. No Monitoring/Logging
  6. Label Selector Mismatches
  7. Service Port Mismatches
  8. Using Multiple Load Balancers
  9. Accidentally Deploying to the Wrong Namespace
  10. Pods Without Resource Requests and Limits
  11. Not Budgeting For Failure With PodDisruptionBudgets
  12. Incorrect Cluster Size and Faulty Auto-Scaling
  13. Inefficient Scheduling Due to Missing Node Selectors
  14. Relying on the Standard Tools
  15. Not Using Pod Security Admission Standards

1. Deploying Containers With the "Latest" Tag

Arguably one of the most frequently violated Kubernetes best practices is using the latest tag when you deploy containers. This puts you at risk of unintentionally receiving major changes which could break your deployments.

The latest tag is used in different ways by individual authors, but most will point latest to the newest release of their project. Using helm:latest today will deliver Helm v3, for example, but it’ll immediately update to v4 after that release is launched.

When you use latest, the actual versions of the images in your cluster are unpredictable and subject to change. Kubernetes will always pull the image when a new Pod is started, even if a version is already available on the host Node. This differs from other tags, where the existing image on the Node will be reused when it exists.

2. Not Using Liveness and Readiness Probes

Probes make your applications more resilient. They inform Kubernetes of the health of your Pods.

Liveness probes instruct Kubernetes when it should restart a container because a problem has occurred. It allows malfunctioning containers to be replaced in circumstances where they’re broken but haven’t stopped themselves. Readiness probes indicate when a container is ready to begin accepting traffic from a service, preventing failures that occur during application startup.

Probes are easy to configure, but they’re often forgotten.

Here’s a simple Pod with both liveness and readiness probes:

apiVersion: v1
kind: Pod
metadata:
  name: probes-demo
spec:
  containers:
    - name: probes-demo
      image: nginx:latest
      livenessProbe:
        httpGet:
          path: /
          port: 80
      readinessProbe:
        httpGet:
          path: /
          port: 80

Several different probe types are supported, including HTTP (shown here), TCP, gRPC, and command execution.

3. Broken Pod Affinity/Anti-Affinity Rules

Pod affinity and anti-affinity rules allow you to instruct Kubernetes which Node is the best match for new Pods. Rules can be conditioned on Node-level characteristics such as labels, or characteristics of the other Pods already running on the Node.

Affinity rules attract Pods to Nodes, making it more likely that a Pod will schedule to a particular Node, whereas anti-affinity has a repelling effect which reduces the probability of scheduling. Kubernetes evaluates the Pod’s affinity rules for each of the possible Nodes that could be used for scheduling, then selects the most suitable one.

The affinity system is capable of supporting complex scheduling behavior, but it’s also easy to misconfigure affinity rules. When this happens, Pods will unexpectedly schedule to incorrect Nodes, or refuse to schedule or all. Inspect affinity rules for contradictions and impossible selectors, such as labels which no Nodes possess.

4. Forgetting Network Policies

Network policies control the permissible traffic flows to Pods in your cluster. Each NetworkPolicy object targets a set of Pods and defines the IP address ranges, Kubernetes namespaces, and other Pods that the set can communicate with.

Pods that aren’t covered by a policy have no networking restrictions imposed. This is a security issue because it unnecessarily increases your attack surface. A compromised neighboring container could direct malicious traffic to sensitive Pods without being subject to any filtering.

Including all Pods in at least one NetworkPolicy is a simple but effective layer of extra protection. Policies are easy to create, too – here’s an example where only Pods labeled app-component: api can communicate with those labeled app-component: database:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: database-policy
spec:
  podSelector:
    matchLabels:
      app-component: database
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app-component: api
  egress:
    - to:
        - podSelector:
            matchLabels:
              app-component: api

5. No Monitoring/Logging

Accurate visibility into cluster utilization, application errors, and real-time performance data is essential as you scale your apps in Kubernetes. Spiking memory consumption, Pod evictions, and container crashes are all problems you should know about, but standard Kubernetes doesn’t come with any observability features to alert you when problems occur.

To enable monitoring for your cluster, you should deploy an observability stack such as Prometheus. This collects metrics from Kubernetes, ready for you to query and visualize on dashboards. It includes an alerting system to notify you of important events.

Kubernetes without good observability can create a false sense of security. You won’t know what’s working or be able to detect emerging faults. Failures will be harder to resolve without easy access to the logs that preceded them.

Read more about fixing CreateContainerConfigError and OOMKilled error.

6. Label Selector Mismatches

Objects such as Deployments and Services rely on correct label selectors to identify the Pods and other objects they manage. Mismatches between selectors and the labels actually assigned to your objects will cause your deployment to fail.

The following example demonstrates this problem:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: demo-app
  template:
    metadata:
      labels:
        # Label does not match the deployment's selector!
        app: demo-application
    spec:
      containers:
        name: demo-app
        image: nginx:latest

When this happens, Kubectl will display a selector does not match template labels error. To fix the problem, adjust your manifest’s spec.selector.matchLabels and spec.template.metadata.labels fields so they have the same key-value pairs.

7. Service Port Mismatches

Similarly, it’s important to make sure your services route traffic to the correct port on your Pods. Incorrect service port definitions can make it look like a Pod has failed, when in fact your traffic simply isn’t reaching it.

The following manifest contains an example of this problem. The service listens on port 9000 and forwards traffic to port 8080 on its Pods, but the container actually expects traffic to hit port 80:

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
  labels:
    app: demo-app
spec:
  image: nginx:latest
  ports:
    - containerPort: 80

---

apiVersion: v1
kind: Service
metadata:
  name: demo-service
spec:
  ports:
    - port: 9000
      protocol: TCP
      targetPort: 8080
  selector:
    app: demo-app

8. Using Multiple Load Balancers

Running multiple LoadBalancer services in your cluster can be useful but is often unintentionally wasteful. Each LoadBalancer service you create will provision a new load balancer and external IP address from your cloud provider, increasing your costs.

Ingress is a better way to publicly expose multiple services using HTTP routes. Installing an Ingress controller such as Ingress-NGINX lets you direct traffic between your services based on characteristics of incoming HTTP requests, such as URL and hostname.

With Ingress, you can use a single load balancer to serve all your applications. Only add another load balancer when your application requires an additional external IP address, or manual control over routing behavior.

Read more about Kubernetes load balancers.

9. Accidentally Deploying to the Wrong Namespace

Kubernetes namespaces logically group objects together, providing a degree of isolation in your cluster. Creating a namespace for each team, app, and environment prevents name collisions and simplifies the management experience.

When using namespaces, remember to specify the target namespace for each of your objects and Kubectl commands. Otherwise, the default namespace will be used. This can be a debugging headache if objects don’t appear where you expected them.

Set the metadata.namespace field on all your namespaced objects so they’re added to the correct namespace:

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
  namespace: demo-app
spec:
  # ...

Include the -n or --namespace flag with your Kubectl commands to scope an operation to a namespace:

# Get the Pods in the demo-app namespace
$ kubectl get pods -n demo-app

This flag is also supported by Kubernetes ecosystem tools such as Helm. For a simpler namespace-switching experience, try kubens to quickly change namespaces and persist your selection between consecutive commands.

10. Pods Without Resource Requests and Limits

Correct resource management is essential to preserve your cluster’s stability. Pods don’t apply any resource limits unless you configure them, which can permit CPU and memory exhaustion to occur.

Set proper resource requests and limits on all your Pods to reduce resource contention. A request instructs Kubernetes to reserve a particular amount of a resource for your Pod, preventing it from scheduling onto Nodes that can’t provide enough capacity. Limits set the maximum amount of the resource which the Pod can use; Pods that exceed a CPU limit will be throttled, while reaching a memory limit prompts the out-of-memory (OOM) killer to terminate the process running in the Pod.

Requests and limits are defined in the spec.container.resources field of a Pod’s manifest:

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
spec:
  containers:
    - name: demo-container
      image: nginx:latest
      resources:
        requests:
          cpu: 100m
          memory: 1Gi
        limits:
          memory: 1Gi

This Pod requests 100m (100 millicores) of CPU time and 1Gi of memory. It will only schedule onto Nodes that can provide sufficient resources. The Pod also has a memory limit set which prevents it using more than the requested 1Gi. It is best practice to set a Pod’s memory limit equal to its request. CPU limits aren’t usually required because Kubernetes proportionally throttles Pods that exceed their request.

11. Not Budgeting For Failure With PodDisruptionBudgets

Pod disruption budgets inform Kubernetes how much disruption your app can tolerate. They’re used during periods of restricted cluster availability, such as when Nodes are offline for an upgrade.

Disruption budgets tell Kubernetes that a specified number of Pods must be kept available when disruption occurs. The following PodDisruptionBudget object preserves at least three replicas of Pods with the app: demo-app label:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: demo-pdp
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: demo-app

Alternatively, you can specify the maximum number of Pods that Kubernetes can evict when disruption is encountered:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: demo-pdp
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: demo-app

Using this mechanism makes your application more responsive to times of reduced cluster capacity. It guarantees that a minimum service level is maintained.

12. Incorrect Cluster Size and Faulty Auto-Scaling

Kubernetes is often seen as a route to easy scalability. A correctly configured cluster lets you dynamically scale both horizontally and vertically by automatically adding new Pods and Nodes when demand spikes. Unfortunately, many teams scale their clusters incorrectly or find their auto-scaling is unpredictable.

Regularly review your cluster’s utilization to check whether it’s still suitable for your workloads. Test autoscaling rules by using a load-testing tool like Locust to direct excess traffic to your cluster. This lets you spot problems earlier, ensuring your Pods will scale seamlessly when real traffic arrives.

13. Inefficient Scheduling Due to Missing Node Selectors

Overall cluster performance depends on Pods being correctly scheduled to suitable Nodes. Many clusters combine several types of Node, such as small 2 CPU/4 GB machines for standard applications and larger 8 CPU/16GB Nodes for intensive backend services.

Cluster utilization will be inefficient if your Pods don’t reliably schedule to the Node pool you’d intended. This can increase your cluster’s costs by forcing new larger Nodes to be created unnecessarily, even though underused smaller ones are available. Avoid this problem by setting labels on your Nodes, then using node selectors to assign each Pod to a compatible Node:

apiVersion: v1
kind: Pod
metadata:
  name: pod-node-selector-demo
spec:
  containers:
    - name: nginx
      image: nginx:latest
  nodeSelector:
    node-class: 2vcpu4gb

This Pod will only schedule to Nodes that have the node-class: 2vcpu4gb label set.

Use the kubectl label command to set labels on your matching Nodes:

$ kubectl get nodes
NAME       STATUS   ROLES           AGE   VERSION
minikube   Ready    control-plane   10d   v1.26.1

$ kubectl label node minikube node-class=2vcpu4gb
node/minikube labelled

Setting proper scheduling constraints will maximize Node usage and maintain stable cluster performance.

14. Relying on the Standard Tools

Standard tools, including Kubectl and the Kubernetes API become inefficient when you’re managing larger clusters. Automating your infrastructure with an IaC solution lets you reliably provision clusters and collaborate on changes.

Avoid manually administering Kubernetes by integrating cluster operations into your CI/CD and GitOps workflows using a management platform such as Spacelift. Spacelift helps you reduce Kubernetes complexity, synchronize changes between environments, and enforce compliance policies. It works with other infrastructure components, including Terraform and Pulumi, too.

15. Not Using Pod Security Admission Standards

Pod security admission standards allow you to enforce security best practices for your cluster. Admission controllers are able to reject Pods which don’t meet specified security criteria, such as when privileged capabilities or direct host port bindings are used.

Kubernetes ships with three different security standards: Privileged, Baseline, and Restricted. The Restricted policy gives you the best protection by enforcing that all Pods adhere to current hardening best practices. Baseline is suitable for less critical scenarios, while Privileged removes the restrictions to support workloads that require privilege escalation.

Read more about Container security best practices and solutions.

Key Points

Kubernetes is the industry-standard orchestrator for cloud-native systems, but popularity doesn’t mean perfection. To get the most from Kubernetes, your developers, and operators need to correctly configure your cluster and its objects to avoid errors, sub-par scaling, and security vulnerabilities.

This guide has covered 15 challenges to look for each time you use Kubernetes. While these will solve the most commonly encountered issues, you should review Kubernetes best practices to get even more out of your cluster. And check out also Kubernetes use cases.

When operating Kubernetes becomes too demanding, try using an IaC platform to provision and manage your clusters. Spacelift is a collaborative solution for visualizing your infrastructure, enforcing policies, and preventing drift.

The Most Flexible CI/CD Automation Tool

Spacelift is an alternative to using homegrown solutions on top of a generic CI. It helps overcome common state management issues and adds several must-have capabilities for infrastructure management.

Start free trial

The Practitioner’s Guide to Scaling Infrastructure as Code

Transform your IaC management to scale

securely, efficiently, and productively

into the future.

ebook global banner
Share your data and download the guide