[November 20 Webinar] Detecting & Correcting Infrastructure Drift

➡️ Register Now

General

Infrastructure Drift Detection and How to Fix It With IaC Tools

Drift detection with IaC tools

So your team learned about Infrastructure as Code (IaC)? They got all excited about this way of managing infrastructure. Soon enough, you had multiple stacks described in definition files. This source code was versioned in your favorite Version Control System such as GitHub, and integrated with your favorite CI/CD platform.

Finally, the infrastructure was built from those definition files, and it was time to celebrate the end of this journey, right? Or was it the end of the journey?

What the definition files describe is the desired state of the infrastructure. The way you want it to be set up. The responsibility of your IaC tool is to turn that desired state into reality. We call the current reality of your infrastructure the actual state. Right after you ran your IaC tool, both are identical but unfortunately, they might not stay in sync for long.

Any difference between the desired state and the actual state is called drift.

We will cover:

  1. What is infrastructure drift?
  2. Why does infrastructure drift happen?
  3. Why do we want to avoid drift?
  4. What is drift detection?
  5. How to detect drift?
  6. What to do when drift is detected?
  7. How to fix configuration drift?

What is infrastructure drift?

Infrastructure drift refers to the situation where the actual state of cloud infrastructure resources deviates from the desired state defined in the Infrastructure as Code (IaC) configuration files. It occurs when changes are made to the cloud resources manually or through other means outside of the IaC management process, resulting in a difference between the codified definition and the real-world deployment.

Why does infrastructure drift happen?

Infrastructure drift can happen for mainly two reasons:

  • Manual changes
  • Overlapping IaC configurations

Manual changes

Most of the drift is usually caused by manual changes performed by individuals.

Some of those reasons are understandable. For example, during an incident, an engineer might need to increase the number of resources to handle an elevated load or make up for resources being down. During times of high stress, when a lot is at stake, manual mitigation changes are perfectly acceptable. The goal is to get to a better place as soon as possible, and the regular IaC process might take too long especially when you need to respond to fast-changing conditions.

This becomes a problem if the changes are not reverted or backported to the IaC definition files after the fire has been put out.

There are also bad reasons for manual changes. Those cannot be justified, even temporarily, and often stem from poor education on best IaC practices, loose access permissions, and a lack of proper communication regarding the infrastructure management process.

Overlapping/conflicting IaC code

In some cases, humans are not directly at fault. Resources may end up being managed by multiple sets of IaC definition files. Applying some definition files might revert changes made by other definition files.

This can happen when the IaC practices evolve over time. For example, it is not uncommon to switch to a different IaC tool after a few years because the team realized that, in hindsight, they had not picked the best tool for their use cases.

Another reason for this is overlapping boundaries between stack definitions, which can happen easily when the infrastructure is extensive and managed by many different teams over a long period of time.

Why do we want to avoid drift?

IaC is all about improving the governance of your infrastructure by defining it as code which allows you to leverage a wealth of practices and tools that have been available to developers for a long time such as code versioning, code reviews, static analysis, automated tests, etc.

Letting drift occur undermines that effort and provides a false sense of governance.

How to avoid drift in the first place?

As we have seen earlier, the primary source of drift is manual changes. Those are often linked to loose access control practices. The Principle of least privilege recommends granting only the necessary permissions. The fewer people who can manually modify the infrastructure, the better. Admin-level access is typically limited to senior infrastructure engineers and SREs.

Because you will always need to have some people with permission to perform manual changes to your infrastructure, you need to make sure that they are aware of the process to either revert or backport those changes in due time.

That being said, even with the most trained engineers and the best intentions, drift will happen. It is inevitable, so you need to make sure that you can easily and quickly detect it and possibly revert it.

To avoid drift, you need first to have a couple of things in place:

  • Use a VCS system for your code Your infrastructure code should always be kept in a VCS system, which should be the only source of truth for your infrastructure. To be effective, you need a branching strategy and only merge changes to the main branch (your truth source) when all your checks pass.
  • Implement RBAC – Drift usually occurs when there are manual changes to your infrastructure. By implementing RBAC, you can even ensure that engineers won’t be able to do updates to your resources from the console. Even though this seems kind of harsh, by implementing least privilege access and having only a couple engineers be able to do changes to your infrastructure manually, drift chances are reduced.
  • Take advantage of change management – Change management processes are key for reducing infrastructure drift. By having a process into play that takes care of your deployment to higher environments, and is followed by all of the engineers, infrastructure drift simply won’t happen.
  • Have a process for critical issues – Sometimes, there are critical issues that result in downtime, and solving them quickly is mandatory for the business. In these cases, changes to your infrastructure may be done manually, to solve the issues as fast as possible. It is key to have a process into play that takes engineers back to the issue at hand and makes them solve the problem in the infrastructure as code, configuration management, CI/CD or container orchestration tools as well.

Even by taking all of these measures into account, drift will still happen. It is inevitable, so you need to make sure that you can easily and quickly detect it and possibly revert it.

What is drift detection?

Drift detection is a mechanism that helps you identify and manage discrepancies between the expected state of your infrastructure and the actual one. To achieve drift detection, you need a tool that constantly monitors, detects, and alerts you about these discrepancies. By detecting drift, you ensure your infrastructure is consistent and compliant.

Drift detection vs. drift management

Drift detection is responsible for identifying discrepancies between your infrastructure and your IaC, while drift management refers to the overall process of detecting drift and what actions to take when drift is detected.

How to detect drift?

As we said earlier, drift is the difference between the desired state of the infrastructure as defined in the IaC source code and the actual state of the infrastructure.

Said differently, drift is getting a non-empty list of proposed changes when running the plan command for your IaC tool.

Here is how to display the proposed changes and detect drift with different IaC tools.

Terraform drift detection

Run the terraform plan command to get the list of proposed changes.

In the screenshot below, we can see that the maximum number of servers in the Auto Scaling Group was set to 5 outside of Terraform which is drift.

Auto Scaling Group outside Terraform

CloudFormation drift detection

CloudFormation has a built-in drift detection feature that can be used either via the AWS Console or via the AWS CLI command.

CloudFormation’s drift detection must be triggered manually. There is no built-in automation to make it run on a schedule. Also, not all resource types can detect drift at this time.

Drift detection checks can be run via the AWS Console:

Drift detection checks via the AWS Console
AWS Console Overview
AWS Console Overview
AWS Console Drift details

Or with on the command line with AWS CLI:

Drift detection on the command line with AWS CLI
describe-steck-drift-detection-status
describe-stack-resource-drifts

Pulumi drift detection

Run the pulumi preview --refresh --stack <STACK NAME> command to get the list of proposed changes.

The screenshot below shows that the tags and user data of the AWS EC2 instance have been modified manually.

pulumi preview --refresh --stack

Spacelift drift detection

Drift can occur at any time. As a result, drift detection must be run on a regular schedule to catch it as quickly as possible which is not practical when running those commands on one’s laptop.

A better approach would be to use a tool such as Spacelift that can check for drift automatically on a schedule that you set.

Spacelift drift detection

The view that shows all the resources for a stack uses an eye-catching icon for resources that have drifted so that they can be easily spotted.

resources that have drifted

Another benefit of using Spacelift is that the drift detection management experience is consistent across the supported IaC tools. Under the hood, different commands will be run but for the most part, the workflow and the screens will be identical.

What to do when drift is detected?

You probably want to get rid of most drift which we will explain in a bit, but there might be manual changes that should make their way to the definition files.

For example, you had to increase the number of resources during an elevated load episode, but realistically, this is not a one-off but the new normal. Then, you should not eliminate the drift but update the IaC definition files to reflect your new expectations.

How to fix configuration drift?

Since drift is having a non-empty list of proposed changes when the definition files have not changed, fixing that drift is applying the proposed changes. That will restore the infrastructure to its desired state.

Here is how to remove the drift and get back to the desired infrastructure state with the main IaC tools.

Terraform drift remediation

Run the terraform apply command to revert the external changes and remove the drift.

terraform apply

Read more: Terraform Drift Detection and Remediation [Guide]

AWS CloudFormation drift remediation

CloudFormation can revert drift in some cases only.

For example, if a resource is missing it will be recreated but if a property of a resource was modified it might not be detected by CloudFormation and as a result, it won’t be fixed automatically.

If CloudFormation cannot automatically fix the detected drift, you can use the information provided to manually revert the unexpected changes.

Pulumi drift remediation

Run the pulumi up --stack <STACK NAME> command to revert the external changes and remove the drift.

pulumi up --stack

Spacelift drift remediation

Spacelift Edit Drift Detection

When drift is detected, Spacelift can optionally revert the changes found by following the same workflow that is used for regular IaC code changes, enforcing all the configured guardrails such as automated validation of the plan and approval workflow.

Spacelift Drift Detected

Read more about drift detection with Spacelift.

If you want to take your infrastructure automation to the next level, create a Spacelift account today or book a demo with one of our engineers.

Key points

Like incidents, drift is inevitable and part of the life of any infrastructure. It must be taken into account when defining the processes and selecting your tools so that you do not get caught off guard and stay on top of things regarding your infrastructure governance.

Detect and Remediate Drift with Spacelift

Drift happens, so let Spacelift deal with it. Spacelift provides drift detection capabilities to any IaC provider to enable the desired state for application infrastructure across teams, applications, and clouds.

Learn More

The Practitioner’s Guide to Scaling Infrastructure as Code

Transform your IaC management to scale

securely, efficiently, and productively

into the future.

ebook global banner
Share your data and download the guide