Kubernetes automates container management tasks so you can efficiently deploy and scale your workloads. It can distribute your containers across clusters of hundreds or thousands of Nodes.
Most developers begin using Kubernetes for stateless apps. A stateless system doesn’t modify its environment or write any persistent data. These components are easy to deploy to Kubernetes because their container instances are interchangeable.
Kubernetes can also be used for stateful systems, though. This could be a database, a backend that writes files to persistent volumes, or a service where one replica is elected the leader to gain control of its neighbors.
In this article, you’ll learn how to use StatefulSet objects to reliably manage state in your cluster.
StatefulSets are used to manage stateful applications that require persistent storage, stable unique network identifiers, and ordered deployment and scaling. They are very useful for databases and data stores that require persistent storage or for distributed systems and consensus-based applications such as etcd and ZooKeeper.
A StatefulSet’s YAML manifest defines a template for its Pods. Kubernetes automatically creates, replaces, and deletes Pods as you scale the StatefulSet, while preserving any previously assigned identities.
StatefulSets provide several advantages over the ReplicaSet and Deployment controllers used for stateless Pods:
- Reliable replica identifiers. Each Pod in a StatefulSet is allocated a persistent identifier. The Pod will retain its identifier even if it’s replaced or rescheduled, ensuring the new Pod runs with the same characteristics.
- Stable storage access. Pods in a StatefulSet are individually assigned their own Persistent Volume claims. The Pod’s volume will be reattached after it’s rescheduled, providing stable storage access after a rollout or scaling operation.
- Rolling updates in a guaranteed order. StatefulSets support automated rolling updates in the order that Pods were created. You can predict the order in which an update will apply, with newer Pods only replaced once older ones have updated.
- Consistent network identities. Pods in StatefulSets have reliable network identities. Their hostnames include their numerical replica identifier, allowing external applications to interact with the same replica after a Pod’s rescheduled.
StatefulSet vs. DaemonSet vs. Deployment
While all three are pretty similar, and their main purpose is to create pods based on your configuration, they are used for the following:
- StatefulSets are used for stateful applications, and they maintain a sticky identity for each of their pods.
- DaemonSet are used to keep a copy of a pod on all the nodes inside the cluster, making them a great choice for node-level services.
- Deployments manage stateless applications, providing declarative updates to applications with capabilities for scaling, rolling updates, and rollbacks.
Read more: Kubernetes StatefulSet vs Deployment
StatefulSets should be used when you’re deploying an application that requires stable identities for its Pods. Reach for a StatefulSet instead of a ReplicaSet or Deployment if your system will be disrupted when a specific Pod replica is replaced.
Replicated databases are a good example of the scenarios that StatefulSets accommodate. One Pod acts as the primary database node, handling both read and write operations, while additional Pods are deployed as read-only replicas.
Although each Pod may run the same container image, each one needs special configuration to set whether it’s in primary or read-only mode. This means your Pods possess their own state:
- postgres-0 – Primary node (read-write).
- postgres-1 – Read-only replica.
- postgres-2 – Read-only replica.
Regular ReplicaSets and Deployments aren’t suitable for this situation. Scaling down a Deployment removes arbitrary Pods, which could include the primary node in your database system. When you use a StatefulSet, Kubernetes terminates Pods in the opposite order to their creation. This ensures it’ll be postgres-2 that’s destroyed first.
Several other StatefulSet features also apply to this example:
- The applications that use your database need to reliably connect to the primary node, so they can both read and write data. The StatefulSet’s stable network identifiers ensure postgres-0.service.namespace.svc.cluster.local will always map to the primary Node, even after scaling or replacing your Pods.
- The read-only replicas shouldn’t start until after the primary is up. StatefulSets use rolling updates so each successive Pod is only created when the previous one is ready. This ensures there’s data available to replicate.
- Each replica has its own sticky volume for storage. The persistent data stored by each replica is bound to its Pod. The version of the database that postgres-1 has replicated needs to be maintained separately to the copy held by postgres-2. StatefulSets can handle this requirement.
Ready to put this example into practice? Here’s how to run three replicas of PostgreSQL in Kubernetes using a StatefulSet.
Creating a StatefulSet
First, create a headless service for your deployment. A headless service is a service that defines a port binding but has its clusterIP
set to None
. StatefulSets require you to create a headless service to control their network identities.
Copy the following YAML and save it as postgres-service.yaml
in your working directory:
apiVersion: v1
kind: Service
metadata:
name: postgres
labels:
app: postgres
spec:
ports:
- name: postgres
port: 5432
clusterIP: None
selector:
app: postgres
Use Kubectl to add the service to your cluster:
$ kubectl apply -f postgres-service.yaml
service/postgres created
Next, copy the following YAML to postgres-statefulset.yaml
. It defines a StatefulSet that runs three replicas of the postgres:latest
image.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
selector:
matchLabels:
app: postgres
serviceName: postgres
replicas: 3
template:
metadata:
labels:
app: postgres
spec:
initContainers:
- name: postgres-init
image: postgres:latest
command:
- bash
- "-c"
- |
set -ex
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
if [[ $ordinal -eq 0 ]]; then
printf "I am the primary"
else
printf "I am a read-only replica"
fi
containers:
- name: postgres
image: postgres:latest
env:
- name: POSTGRES_USER
value: postgres
- name: POSTGRES_PASSWORD
value: postgres
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
ports:
- name: postgres
containerPort: 5432
livenessProbe:
exec:
command:
- "sh"
- "-c"
- "pg_isready --host $POD_IP"
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 5
readinessProbe:
exec:
command:
- "sh"
- "-c"
- "pg_isready --host $POD_IP"
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1Gi
Apply the manifest to your cluster to create your StatefulSet:
$ kubectl apply -f postgres-statefulset.yaml
statefulset.apps/postgres created
Now you can list the Pods running in your cluster. The names of the three Pods from your StatefulSet will be suffixed with the sequential index they’ve been assigned:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
postgres-0 1/1 Running 0 74s
postgres-1 1/1 Running 0 63s
postgres-2 1/1 Running 0 51s
The StatefulSet creates each Pod in order, once the previous one has entered the Running state. This ensures the replicas don’t start until the previous Pod is ready to synchronize data. If a ReplicaSet had been used, all three Pods would have been created at the same time.
The StatefulSet uses init containers to determine whether new Pods are the Postgres primary or a replica. Each init container inspects its numeric index assigned by the StatefulSet controller; if it’s 0, the Pod is the first in the StatefulSet, so it becomes the primary database node.
Otherwise, it’s a replica:
$ kubectl logs postgres-0 -c postgres-init
I am the primary
$ kubectl logs postgres-1 -c postgres-init
I am a read-only replica
This demonstrates how StatefulSets let you consistently designate Pods as having a specific role. In a real-life Postgres example, you’d use your init containers to set up database replication from the primary Pod to the replicas. When the Pod’s index is 0
, it should be configured as the primary; when it’s a higher number, the Pod is a replica that needs to synchronize the existing data and run in read-only mode.
Each Pod in the StatefulSet gets its own Persistent Volume and Persistent Volume Claim. These are created using the manifest template defined in the StatefulSet’s volumeClaimTemplates
field.
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-6b48180c-0728-4666-aea9-12e0960f732e 1Gi RWO Delete Bound postgres-sts/data-postgres-0 standard 10m
pvc-83fc4a44-4927-454e-83e8-2c2f4c80af07 1Gi RWO Delete Bound postgres-sts/data-postgres-1 standard 10m
pvc-d7496cf0-97d2-405b-bf95-b28bf9bcedec 1Gi RWO Delete Bound postgres-sts/data-postgres-2 standard 10m
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-postgres-0 Bound pvc-6b48180c-0728-4666-aea9-12e0960f732e 1Gi RWO standard 10m
data-postgres-1 Bound pvc-83fc4a44-4927-454e-83e8-2c2f4c80af07 1Gi RWO standard 10m
data-postgres-2 Bound pvc-d7496cf0-97d2-405b-bf95-b28bf9bcedec 1Gi RWO standard 10m
This allows the Pods to manage their own state, independently of the others in the StatefulSet.
Using rolling updates
Finally, let’s try scaling this StatefulSet. Running the following command will add two new replicas without affecting the existing Pods:
$ kubectl scale sts postgres --replicas 5
statefulset.apps/postgres scaled
Now there are five Pods, with the new ones created sequentially, one after the other:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
postgres-0 1/1 Running 0 73m
postgres-1 1/1 Running 0 73m
postgres-2 1/1 Running 0 73m
postgres-3 1/1 Running 0 21s
postgres-4 0/1 Running 0 8s
When you scale down the StatefulSet, Kubernetes terminates Pods in the reverse order of their creation:
$ kubectl scale sts postgres --replicas 2
statefulset.apps/postgres scaled
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
postgres-0 1/1 Running 0 75m
postgres-1 1/1 Running 0 75m
Kubernetes has scaled down from five to two replicas by removing the three newest Pods.
As for every Kubernetes resource, the debugging process for a StatefulSet is not very different. You will need to:
kubectl get statefulsets
– to get an overview of what is happening with your StatefulSetkubectl describe statefulset statefulset_name
– to get more information about your stateful setkubectl logs pod_name
– to get even more information about the pods inside of your StatefulSet- review liveness and readiness probes
In addition to this, because StatefulSets use persistent storage, you should also ensure that Persistent Volume Claims (PVC) are correctly bound to your Persistent Volumes (PV).
To delete a StatefulSet in Kubernetes you will need to use the kubectl delete statefulset <name>
command. This doesn’t, however, delete the volumes associated with the StatefulSet to prevent data loss. To remove the PVCs and consequently the volumes, you must delete them separately using the kubectl delete pvc <name>.
You will also need to ensure that data is backed up or no longer needed before deleting PVCs, as this action is irreversible and results in data loss.
Although StatefulSets make it much easier to run stateful workloads in Kubernetes, they do come with some common “gotchas” that can trip you up. These are well-documented but still surprising when you encounter them for the first time.
- There’s no built-in way to resize volumes. StatefulSets don’t let you resize your volumes after initial creation. This can quickly become problematic as your service scales. There is a workaround but it’s clunky: you have to delete the StatefulSet and orphan its Pods, then recreate it with the updated volume template. The StatefulSet controller should detect your orphaned Pods and reattach them to the new object.
- Volumes aren’t deleted by default. StatefulSets don’t delete the volumes allocated to your Pods by default, even if a Pod is terminated because the StatefulSet’s removed or scaled down. This minimizes the risk of data loss but can cause you to accumulate many old volumes from redundant StatefulSets. Since Kubernetes v1.23, you can opt-in to automatically deleting volumes by enabling the
StatefulSetAutoDeletePVC
API server feature gate and setting thepersistentVolumeClaimRetentionPolicy
field on your StatefulSets. - Deleting a StatefulSet doesn’t guarantee that Pods will terminate in order. Scaling down a StatefulSet reliably terminates Pods in the reverse order of their creation. This behavior doesn’t occur when a StatefulSet’s deleted, though. Kubernetes will revert to deleting Pods simultaneously, which might prevent your application from cleanly exiting. You can avoid this problem by scaling your StatefulSets down to 0 before you delete them.
- You have to manually create headless services to benefit from reliable network identifiers. The StatefulSet controller won’t create services for you. Forgetting to define one will leave you without the predictable network identity features.
Keeping these sticking points in mind will help ensure your StatefulSet deployments perform as you expect.
StatefulSets are Kubernetes objects for running stateful applications in your cluster. They provide stable Pod identifiers, sticky storage, and automated rolling updates that let you predict the replicas that’ll be affected by scaling operations.
Use a StatefulSet each time you deploy a service with non-interchangeable Pods. Without StatefulSets, critical replicas could be replaced after you scale your deployment or Pods get rescheduled.
And check out how Spacelift brings the benefits of CI/CD to infrastructure management. You can use Spacelift to deploy changes to Kubernetes with GitOps while benefiting from robust security policies and automated compliance checks. Spacelift also works with other infrastructure as code (IaC) providers, so you can use similar techniques to manage every component of your stack.
The Most Flexible CI/CD Automation Tool
Spacelift is an alternative to using homegrown solutions on top of a generic CI. It helps overcome common state management issues and adds several must-have capabilities for infrastructure management.