Kubernetes metrics enable you to measure your cluster’s health and performance with values such as Node CPU utilization, Pod restart counts, and API server latency. Monitoring these values is a crucial component in any Kubernetes observability strategy.
In this article, we explain the main categories of Kubernetes metrics. We provide example metrics, discuss what they reveal, and share the top tools and best practices to implement. You’ll then be ready to build an effective Kubernetes monitoring system that helps you improve your cluster’s reliability.
Kubernetes metrics are quantitative data points that provide insight into the performance, health, and resource usage of a Kubernetes cluster. They are discrete numerical values that expose a system’s activity.
For instance, your cluster may have 20% CPU utilization, 5 Pods in a Pending state, or an average Pod scheduling time of 2 seconds. These are all examples of metrics and their values.
Metric readings are specific to a particular point in time. To query how metrics change over time, you must use tools like Prometheus to regularly collect your metrics and store the readings in a centralized database.
Why is monitoring Kubernetes so important?
Regularly monitoring Kubernetes metrics reveals what’s happening in your cluster. Monitoring helps you spot performance regressions, detect stability issues, and understand capacity and usage.
Developers might watch Pod failures to catch crashes, whereas SREs and leadership teams track service metrics such as request latency and error rates to measure SLO compliance.
Kubernetes does not ship a full metrics stack by itself. It provides metrics APIs and components that expose metrics, and many managed platforms make it easy to enable Metrics Server for basic resource metrics (CPU and memory) through the Kubernetes Metrics API. Metrics Server is intentionally designed to be minimal rather than a long-term storage solution.
Before running production clusters, you should deploy a robust solution to collect, store, and analyze the metrics that matter most. Common setups include Prometheus for scraping, kube-state-metrics for object state, and a time-series backend with dashboards.
Because metrics tell you what is happening, not why it is happening, you should also collect logs and traces (for example, using OpenTelemetry) to achieve complete visibility. It is also useful to review Kubernetes Events for operational context.
Now that we’ve covered the basics of what metrics are, let’s take a closer look at some of the main types you’ll find in a Kubernetes cluster.
The key Kubernetes metrics include:
- Cluster metrics
- Node metrics
- Pod, container, and workload metrics
- Network metrics
- Storage metrics
- Application metrics
Each family of metrics provides insights into a different area of your cluster and its workloads. Instrumenting your cluster with coverage for all six types allows you to monitor every part of your operations. We’ll discuss some of the tools you can use later in this article.
1. Cluster metrics
Cluster-level metrics allow you to measure the performance of your cluster’s control plane. Components to monitor in this group include the Kubernetes API server, controller manager, etcd, and scheduler. Each component provides detailed metrics in Prometheus format from its /metrics HTTP endpoint.
Some of the key cluster metrics to monitor include:
- API server error rates (4xx and 5xx response codes)
- API server response times and sizes
- Average API server request latency
- Average scheduling delay for new Pods
- Number of scheduling attempts required for new Pods
- Number of Pods currently pending
- Controller reconciliation failure rates
- etcd read/write latency
- etcd leader election events
These values provide insights into your cluster’s performance and stability. They enable you to detect emerging problems that could threaten your ability to manage your cluster or operate your workloads.
2. Node metrics
Node-level metrics describe the performance of your cluster’s Nodes. These are the compute instances that run your Pods.
Metrics in this category primarily relate to each Node’s resource consumption, but it’s also important to monitor the Kubelet process that runs on each Node. Kubelet manages communication with the cluster control plane, so any issues can prevent new actions from applying.
Common Node-level Kubernetes metrics include:
- CPU usage
- Memory usage
- Pod count
- Disk space usage
- Storage bandwidth saturation
- Network bandwidth saturation
- Desired vs running Pod count
- Kubelet operation rate
- Kubelet operation error rate
- Kubelet CPU and memory usage
- Time taken to start new Pods
Because these metrics need to be scraped from each Node in your cluster, monitoring tools typically deploy a metrics collection agent as a Kubernetes DaemonSet. DaemonSets automatically replicate a Pod across all the Nodes in your cluster. This ensures complete monitoring coverage with minimal configuration.
3. Pod, container, and workload metrics
Pod and container-level metrics allow you to analyze resource consumption, health, and reliability at the workload level. These metrics provide crucial data on what’s happening in your workloads, so you can keep your apps running smoothly.
Here are some of the key metrics in this category:
- Pod-level CPU and memory consumption
- Container-level CPU and memory consumption
- Liveness, readiness, and startup probe results
- Number of times that Pods have been marked unhealthy (due to a failing liveness probe)
- Time taken for new Pods to become ready (pass their first readiness probe)
- Number of Pods in an error state, such as CrashLoopBackOff
- Pod restart counts
These metrics provide in-depth insights into the health of your workload. They’re the first values to check when you’re having problems with a particular application. The metrics reveal whether your Pods are running normally and if they have enough free resources to perform as expected.
4. Network metrics
Beyond Node-level network bandwidth metrics, granular network monitoring systems allow you to precisely track problems such as latency, packet loss, and incorrect routing.
Some frequently tracked values include:
- Overall packet loss
- Latency between services
- Average bandwidth usage
- Total network throughput
- Packet receive and transmit rates
- Ingress requests served per second
- Ingress error rates
- Ingress bandwidth usage
- Ingress requests per service
- Ingress requests served without a valid certificate
Network-level metrics are typically provided by your cluster’s CNI (Container Network Interface) networking plugin. Ingress metrics are reported by popular Ingress controllers such as Ingress-NGINX and Traefik.
5. Storage metrics
Storage metrics monitor Kubernetes Persistent Volumes. Monitoring volume capacity and utilization stats enables your apps to access the storage they need.
Specific metrics to monitor include:
- The number of volumes in your cluster
- The number of active storage classes and access modes
- Volume space usage
- Volume inodes usage
- Disk activity and utilization
These metrics provide the data needed to reliably operate stateful apps in your cluster. Kubernetes storage works differently from traditional models, so you need clear visibility into the status of each volume you create.
6. Application metrics
Application metrics are the final piece of a Kubernetes monitoring stack. Although not strictly tied to Kubernetes itself, these metrics are often evaluated alongside the other families discussed above. They’re the bespoke metrics you create for your workloads to track your own KPIs.
Because application metrics are unique to your environment, they can vary considerably between projects and teams. Here are a few suggestions of values you might track:
- Application error rates
- Application uptime
- Successful user sign-ups
- The number of transactions processed
- Third-party integration request durations
- Job and message queue lengths
You can expose metrics from your apps using the libraries provided by observability suites. For instance, if you choose Prometheus or Datadog for your cluster-level monitoring, then you could instrument your app with their respective client libraries. Standardizing on a single platform for all types of metrics delivers unified visibility into your cluster and its workloads.
As we mentioned earlier, Kubernetes exports some cluster-level metrics from its control plane components. However, it doesn’t include a system to easily collect, monitor, or analyze the data, nor does it automatically provide Node, Pod, storage, or network-level metrics.
Fortunately, there’s a great ecosystem of tools and platforms you can use to easily manage your Kubernetes metrics:
- Metrics-Server: A Kubernetes project that provides basic Node and Pod-level resource utilization data. You can access live metrics using the kubectl top command.
- Kube-State-Metrics: This tool is maintained within the official Kubernetes repositories. It provides detailed metrics in Prometheus format that describe the state of the objects in your cluster, such as the number of running Pods and PVC utilization.
- cAdvisor: cAdvisor is an open-source container monitoring tool developed by Google. It provides metrics about running containers, including memory usage, CPU consumption, and total CPU seconds. cAdvisor is built into Kubelet, so you can access cAdvisor metrics from Kubelet’s metrics endpoint.
- Prometheus: Prometheus is a leading open-source time-series database that’s useful for storing and querying all types of metrics. You can use it to scrape metrics from Kubernetes control plane components, Nodes (via the Node-Exporter agent), and your own applications. Prometheus is easy to operate in Kubernetes clusters using the Kube-Prometheus-Stack community project (see below).
- Grafana: Grafana is a centralized observability solution that lets you build visual dashboards from your data. It works with data sources including Prometheus.
- Kube-Prometheus-Stack: A popular community-managed project, this Helm chart automates the installation and configuration of Prometheus and Grafana in your cluster.
- Datadog: Datadog is a popular observability platform that collects and stores Kubernetes metrics and logs. Deploying the Datadog Agent in your cluster automates the collection of metrics data from your environment.
- AWS Cloud Watch and Container Insights: These AWS-specific monitoring solutions provide deep visibility into Amazon EKS clusters and associated resources. You can analyze the metrics within your AWS account.
- Google Cloud Monitoring: Google’s cloud observability solution lets you monitor the health and performance of GKE clusters. You can enable cluster, Node, and workload metrics as configurable packages.
If you’re looking for an all-in-one solution, Kube-Prometheus-Stack is typically the easiest option for getting started. The single Helm chart simplifies the installation of Prometheus and Grafana in your cluster.
Moreover, the chart automatically configures Prometheus to scrape Kubernetes control plane, Node, and workload-level (kube-state-metrics) data. You can jump straight into analyzing your metrics without having to install extra components or manually prepare scrape settings.
Let’s wrap up with a look at the five best practices for collecting and using metrics in Kubernetes. These tips contribute to a scalable, resilient monitoring system that’s ready to support your day-to-day cluster operations.
1. Prioritize metrics that are tied to business outcomes
The most useful metrics are those directly linked to your operations. Prioritize collecting and monitoring metrics that indicate whether you’re meeting your KPIs. Network latency, error rates, and Pod failure counts can be helpful indicators of compliance with uptime-based SLOs, for example.
2. Consistently label your Kubernetes resources
Metrics allow you to perform detailed analysis of the activity in your Kubernetes clusters. However, this is only possible if you consistently label your objects with meaningful metadata.
For instance, placing labels such as example.com/team, example.com/project, and example.com/environment on each resource allows you to accurately compare metrics from different deployments.
3. Only monitor relevant, actionable metrics
Tools like Kube-State-Metrics and Kubelet’s cAdvisor integration generate a huge number of metrics. In practice, not all available metrics are useful on a day-to-day basis. It’s best to focus on the few metrics that you will act upon when they change. Creating custom dashboards and alerts for the most relevant information helps prevent distraction, confusion, and information overload.
4. Correlate metrics changes to system activity with logs and traces
Metrics are just one of the three pillars of observability: Collecting logs and traces alongside your metrics allows you to match changes back to specific events in your system.
A cluster that’s instrumented for all three data types allows you to pinpoint issues such as a CPU spike, then jump into logs to investigate what was happening at the time. Afterwards, you can review your traces to investigate how the problematic request passed through your infrastructure.
5. Use your metrics to automate your cluster operations
Finally, if you’re only manually monitoring metrics, you’re missing out on an opportunity to simplify your operations. Configuring automated alerts is a first step towards greater automation, but metrics data also unlocks powerful workload auto-scaling options.
Horizontal Pod Autoscaling (HPA) dynamically adds and removes Pods based on live metrics, while Vertical Pod Autoscaling (VPA) lets you optionally resize the resource requests of existing Pods. Using these mechanisms lets you free up team members from manually scaling workloads.
If you need help managing your Kubernetes projects, consider Spacelift. It brings with it a GitOps flow, so your Kubernetes Deployments are synced with your Kubernetes Stacks, and pull requests show you a preview of what they’re planning to change.
With Spacelift, you get:
- Policies to control what kind of resources engineers can create, what parameters they can have, how many approvals you need for a run, what kind of task you execute, what happens when a pull request is open, and where to send your notifications
- Stack dependencies to build multi-infrastructure automation workflows with dependencies, having the ability to build a workflow that can combine Terraform with Kubernetes, Ansible, and other infrastructure-as-code (IaC) tools such as OpenTofu, Pulumi, and CloudFormation.
- Self-service infrastructure via Blueprints, enabling your developers to do what matters – developing application code while not sacrificing control
- Creature comforts such as contexts (reusable containers for your environment variables, files, and hooks), and the ability to run arbitrary code
- Drift detection and optional remediation
If you want to learn more about Spacelift, create a free account today or book a demo with one of our engineers.
Metrics tell you what’s going on in your Kubernetes clusters. Monitoring metrics such as CPU usage, Pod scheduling delays, and network latency lets you track stability and performance trends. It enables you to make data-driven improvements, so your workloads run more reliably.
In this article, we’ve explored the six main types of Kubernetes metrics along with the tools you can use to collect them. Achieving full cluster visibility depends on good coverage in every category. Beyond metrics, it’s also vital to implement robust systems for collecting Kubernetes logs and traces. Metrics tell you what’s happening, but with logs and traces, you can also explain why changes have occurred.
Ready to learn more about Kubernetes monitoring? Check out our Kubernetes observability guide to find more tools, tips, and best practices for different use cases.
Manage Kubernetes easier and faster
Spacelift allows you to automate, audit, secure, and continuously deliver your infrastructure. It helps overcome common state management issues and adds several must-have features for infrastructure management.
Frequently asked questions
What is the difference between Kubernetes logs and metrics?
Kubernetes logs capture detailed event data and messages from containers, pods, or system components, while metrics provide numerical measurements of system performance over time.
How does the Kubernetes Metrics Server work?
The Kubernetes Metrics Server collects resource usage data from each node and pod, aggregating CPU and memory metrics through the Kubelet’s Summary API. It serves as the central source for resource metrics that are used by the Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), and kubectl top commands.
When a metrics request is made, the Metrics Server queries each Kubelet over HTTPS, retrieves summarized statistics from cAdvisor, and then stores them temporarily in memory. It does not persist data long-term, focusing instead on providing current cluster performance snapshots.
Can you view Kubernetes metrics without Prometheus?
Yes. You can view Kubernetes metrics without Prometheus using the Metrics Server, kubelet and cAdvisor endpoints, the Kubernetes Dashboard, or external backends like cloud monitoring services.
