So your team learned about Infrastructure as Code (IaC)? They got all excited about this way of managing infrastructure. Soon enough, you had multiple stacks described in definition files. This source code was versioned in your favorite Version Control System such as GitHub, and integrated with your favorite CI/CD platform.
Finally, the infrastructure was built from those definition files, and it was time to celebrate the end of this journey, right? Or was it the end of the journey?
What the definition files describe is the desired state of the infrastructure. The way you want it to be set up. The responsibility of your IaC tool is to turn that desired state into reality. We call the current reality of your infrastructure the actual state. Right after you ran your IaC tool, both are identical but unfortunately, they might not stay in sync for long.
Any difference between the desired state and the actual state is called drift.
It might be surprising at first that infrastructure drifts. After all, you are managing it via code, right? Unfortunately, there are different ways your infrastructure can be modified outside of your IaC process.
Most of the drift is usually caused by manual changes performed by individuals.
Some of those reasons are understandable. For example, during an incident, an engineer might need to increase the number of resources to handle an elevated load or make up for resources being down. During those times of high stress when a lot is at stake, manual mitigation changes are perfectly acceptable. The goal is to get to a better place as soon as possible, and the regular IaC process might take too long. Especially when you need to respond to fast-changing conditions.
This becomes a problem if, after the fire has been put out, the changes are not reverted or backported to the IaC definition files.
There are also bad reasons for manual changes. Those cannot be justified, even temporarily, and often stem from poor education on best IaC practices, loose access permissions, and a lack of proper communication regarding the infrastructure management process.
Overlapping/conflicting IaC code
In some cases, humans are not directly at fault. Resources may end up being managed by multiple sets of IaC definition files. Applying some definition files might revert changes made by other definition files.
This can happen when the IaC practices evolve over time. For example, it is not uncommon to switch to a different IaC tool after a few years because the team realized that, in hindsight, they had not picked the best tool for their use cases.
Another reason for this is overlapping boundaries between stack definitions which can happen easily when the infrastructure is extensive and managed by many different teams over the course of a long period of time.
IaC is all about improving the governance of your infrastructure by defining it as code which allows you to leverage a wealth of practices and tools that have been available to developers for a long time such as code versioning, code reviews, static analysis, automated tests, etc.
Letting drift occur undermines that effort and provides a false sense of governance.
As we have seen earlier, the primary source of drift is manual changes. Those are often linked to loose access control practices. The Principle of least privilege recommends granting only the necessary permissions. The fewer people can manually modify the infrastructure, the better. Admin-level access is typically limited to senior infrastructure engineers and SREs.
Because you will always need to have some people with the permissions to perform manual changes to your infrastructure, you need to make sure that they are aware of the process to either revert or backport those changes in due time.
That being said, even with the most trained engineers and the best intentions, drift will happen. It is inevitable, so you need to make sure that you can easily and quickly detect it and possibly revert it.
As we said earlier, drift is the difference between the desired state of the infrastructure as defined in the IaC source code and the actual state of the infrastructure.
Said differently, drift is getting a non-empty list of proposed changes when running the plan command for your IaC tool.
Here is how to display the proposed changes with the main IaC tools.
terraform plan command to get the list of proposed changes.
In the screenshot below, we can see that the maximum number of servers in the Auto Scaling Group was set to 5 outside of Terraform which is drift.
CloudFormation has a built-in drift detection feature that can be used either via the AWS Console or via the AWS CLI command.
CloudFormation’s drift detection must be triggered manually. There is no built-in automation to make it run on a schedule. Also, not all resource types can detect drift at this time.
Drift detection checks can be run via the AWS Console:
Or with on the command line with AWS CLI:
pulumi preview --refresh --stack <STACK NAME> command to get the list of proposed changes.
The screenshot below shows that the tags and user data of the AWS EC2 instance have been modified manually.
Drift can occur at any time. As a result, drift detection must be run on a regular schedule to catch it as quickly as possible which is not practical when running those commands on one’s laptop.
A better approach would be to use a tool such as Spacelift that can check for drift automatically on a schedule that you set.
The view that shows all the resources for a stack uses an eye-catching icon for resources that have drifted so that they can be easily spotted.
Another benefit of using Spacelift is that the drift detection management experience is consistent across the supported IaC tools. Under the hood, different commands will be run but for the most part, the workflow and the screens will be identical.
You probably want to get rid of most drift which we will explain in a bit, but there might be manual changes that should make their way to the definition files.
For example, you had to increase the number of resources during an elevated load episode, but realistically, this is not a one-off but the new normal. Then, you should not eliminate the drift but update the IaC definition files to reflect your new expectations.
Since drift is having a non-empty list of proposed changes when the definition files have not changed, fixing that drift is applying the proposed changes. That will restore the infrastructure to its desired state.
Here is how to remove the drift and get back to the desired infrastructure state with the main IaC tools.
terraform apply command to revert the external changes and remove the drift.
CloudFormation can revert drift in some cases only.
For example, if a resource is missing it will be recreated but if a property of a resource was modified it might not be detected by CloudFormation and as a result, it won’t be fixed automatically.
If CloudFormation cannot automatically fix the detected drift, you can use the information provided to manually revert the unexpected changes.
pulumi up --stack <STACK NAME> command to revert the external changes and remove the drift.
When drift is detected, Spacelift can optionally revert the changes found by following the same workflow that is used for regular IaC code changes, enforcing all the configured guardrails such as automated validation of the plan and approval workflow.
Like incidents, drift is inevitable and part of the life of any infrastructure. It must be taken into account when defining the processes and selecting your tools so that you do not get caught off guard and stay on top of things regarding your infrastructure governance.
Automation and Collaboration Layer for Infrastructure as Code
Spacelift is a flexible orchestration solution for IaC development. It delivers enhanced collaboration, automation and controls to simplify and accelerate the provisioning of cloud based infrastructures.