Organizations using Terraform to manage their infrastructure as code (IaC) need a reliable solution to ensure their infrastructure’s actual state aligns with its intended state.
Terraform stores information about the infrastructure it manages in state files, and any change to the infrastructure Terraform manages that Terraform has not triggered is called “drift.”
In this post, we will explore the reasons why drift happens, its associated risks, and the options available to remediate it.
What we will cover:
Terraform drift refers to the situation where the actual state of infrastructure in an environment diverges from the state defined in Terraform configuration files. Drift can happen due to changes outside of Terraform workflow, such as manual modifications, automated external processes, or resource eviction.
Manual changes | As a DevOps engineer, when you have severity one issues, you may make manual changes just to get the systems up and running, but this also means that you have to make these changes in the code afterward. Sometimes, you forget that you’ve made these changes, and your configuration will drift. |
External processes | You may have automated processes outside Terraform’s control, such as autoscaling actions triggered by cloud providers or external scripts that make changes to your infrastructure. |
Resource eviction | Due to cost-saving measures and policy violations, resources can be evicted or deleted, which can cause drift. |
Drift is a significant concern and can lead to inconsistencies that complicate infrastructure management.
Consistency is a key goal when managing infrastructure using Terraform. With IaC, you can keep multiple environments consistent, irrespective of how many times they are recreated.
Infrastructure drift undermines that consistency. Here are some of its common sources:
Manual changes
Manual changes are a primary cause of infrastructure drift. These can be made either deliberately or unintentionally.Â
If a deployed system’s configuration needs to be changed to address a critical production incident, doing it manually can be the fastest way to fix it. Similarly, certain network configurations are tweaked for testing purposes to address a certain network security vulnerability. These are examples of intentional manual changes to the infrastructure.
However, sometimes users are not even aware they have made a manual change to the infrastructure. Identifying the components managed by Terraform is not always intuitive. When users log into the web console, they may perform specific tasks on resources without the knowledge of Terraform’s state file. Executing scripts that make API calls to the cloud platform is another possible source of unintentional change.
Irrespective of whether the change is deliberate or unintentional if the changes are not ported back into the Terraform configurations, this results in drift.
Automation tools
Organizations and large teams implement multiple automation tools to streamline operations. These tools all have specific workflows and lifecycle management capabilities, and responsibilities can overlap if boundaries of influence for these tools are unclear or wrongly implemented.
For example, using Terraform for infrastructure management alongside a configuration management tool like Ansible creates a high possibility of infrastructure drift. Although Ansible is responsible for managing the application layer of a business service, it also has infrastructure provisioning capabilities.
Ironically, the more automation tools you implement, the more manual effort is required to reconcile the changes they create in Terraform state files. Â
User scripts
Cloud platforms facilitate the triggering of certain event-driven user-defined scripts. These scripts allow users to perform actions on a resource or execute API calls to modify another resource.
For example, when creating Linux-based EC2 instances in AWS, it is possible to execute bash/shell scripts when the instance boots. These scripts are provided in the user_data field when creating an instance from the web console. Similarly, Terraform provides a way to supply the same using IaC.
Although providing user_data is not mandatory, it enables various automation capabilities to manage virtual machines. User_data scripts are used to run upgrades, install security patches, install dependencies, invoke various system processes, etc., as soon as the system boots.
Bash and shell scripts are powerful because they can change any network configuration of the system and execute API calls to modify other resources. This has the potential to introduce drift in infrastructure.
Terraform IaC manages the infrastructure’s end-to-end lifecycle. It is responsible for creating and recreating cloud resources and consistently introducing changes. To do this successfully, up-to-date information is saved in the state files.
Essentially, infrastructure drift is untracked changes. These untracked changes pose risks of varying severity and could have a drastic impact on the system. Similarly, some changes may be beneficial for the system, improving attributes like reliability, security, and performance.
Given the nature of infrastructure drift in the context of Terraform IaC, if it’s not addressed, it can create blind spots while managing the infrastructure in scope. Infrastructure changes that fall out of the scope of Terraform management go unnoticed.
Security vulnerabilities
Infrastructure drift exposes the system’s security vulnerabilities to attackers. This has the potential to cause serious damage not just to the system but to businesses in general. For example, when security group rules are manually modified to test a certain use case for public access, this can have multiple impacts, ranging from data breaches to the entire system being compromised.
Compliance violations
Automated policy execution or manual configuration changes can lead to breaches of regulatory requirements — for example, drift that results in the personal data of users being exposed to the public or actions that enable unauthorized access to data and resources.
Performance and operational difficulties
Infrastructure drift can impair system performance because of latency or reduced network throughput, underprovisioning of resources, disabling of auto-scaling configurations, etc. Drift also makes it challenging to identify, analyze, and investigate the root cause of the issues. Unknown and untracked changes introduce challenges that increase downtime and also impact the mean time to resolution.
Higher costs
Changes caused by infrastructure drift can have wide-ranging financial implications. Provisioning of unutilized cloud resources generates unnecessary cloud platform costs, and, because the changes are not tracked, the cost of remediation and maintenance also increases.
You can learn more about drift in this article: Infrastructure Drift Detection and How to Fix It With IaC Tools.
A simple example would be that you have a terraform configuration that creates three EC2 instances. After these changes are deployed, someone goes into the AWS console and deletes one of them manually.
This has caused a drift because your current state of infrastructure doesn’t reflect your Terraform state, or your Terraform configuration.Â
To solve the drift, you have two options:
- Reapply the terraform code to recreate the missing instance.
- Change the Terraform configuration to reflect the current state of your infrastructure.
When infrastructure drift occurs, the first challenge is to identify it. As we have seen, drift has multiple sources, so it is not possible to track where and when the drift happens without a monitoring mechanism.
You can identify the existence of drift by running a couple of Terraform commands. The terraform refresh
command helps refresh the state file, and the terraform plan
command provides a plan of action by analyzing the state file and current configuration. The output provided by the plan command helps us identify drift.Â
Without changing the Terraform config, if the execution of a plan command suggests either modifying or recreating a certain resource, this indicates that something else has modified the infrastructure. But this depends on when exactly the commands are run. It often happens when we prepare and check the status before implementing other intended changes.
Periodic monitoring of IaC-managed infrastructure to proactively check for drift is challenging. Drift detection provided by Spacelift helps to identify and highlight infrastructure drift promptly. Configuring a drift monitor is as simple as configuring a cron job.
To start, select the stack you wish to configure drift detection for and navigate to Settings > Scheduling. A couple of notable control options are provided here:
- Reconcile: When this is enabled, Spacelift automatically remediates the drift. When infrastructure drift is identified, Spacelift triggers the “terraform apply” workflow to restore the original state of infrastructure as per the Terraform configuration.
- Schedule: This is a simple cron job notation that determines the scanning frequency and compares the state of deployment. In the example below, the drift detection happens every 15 minutes.
When drift is detected, it is represented in a very intuitive way, making it easy to interpret its impact.
The screenshot below shows that one of the network components has drifted.
Clicking on the drifted block quickly reveals details of the drift.
Read more in the documentation.
Whenever drift is detected, it is important to identify the factors that caused it. As discussed previously, the changes introduced out of the scope of Terraform could be either desirable or unwanted. Here are some examples:
- Changes caused by the scopes of automation tools overlapping are usually desirable. However, the responsibility is not clearly defined.
- Changes introduced manually to troubleshoot a related issue elsewhere but overlooked and not reverted are unwanted because they expose the system to various vulnerabilities and could have cost implications.
- Hotfixes implemented to address issues in critical services can be either permanent fixes or temporary workarounds, which makes it difficult to classify them as either desirable or unwanted. Further investigation is needed to decide.
As seen from the examples above, understanding drift needs some analysis. The course of remediation action usually boils down to the following:
- If the changes are desirable, import the configuration under Terraform management scope.
- If the changes are not desired, reinstate the original state by running “terraform apply”.
- If a resource is not supposed to be managed by Terraform, disassociate it from Terraform state and configuration.
When drift detection is enabled, Spacelift highlights the drift in the very next run. It depends on how frequently the drift detection runs are configured. When we enable the “Reconcile” option in the drift detection schedule, Spacelift automatically triggers Terraform runs to reinstate the original configuration. This is appropriate when the scope is clearly defined, and resource management policies are in place.
However, if the boundaries are not clearly defined, you should turn off the “Reconcile” option. This is because there may be a need to either import drift or disassociate infrastructure from the current Terraform configuration. The drift detection schedule again plays an important role in confirming mitigation actions post-import/disassociation.
Several tools can help you identify drift and some of them can even remediate the drift for you. Below is a list of these tools:
- Terraform drift detection documentation
- Brainboard
- Terratest
- Driftctl
- TestInfra
- Kitchen-Terraform
Terraform drift detection documentation
The Terraform drift detection documentation offers a comprehensive guide to identifying and managing drift within your Terraform configurations. It outlines how to use some of Terraform’s native features, such as plan and apply, to detect changes not reflected in your Terraform configuration.
Brainboard
Brainboard is a cloud architecture design tool that offers several features to manage and visualize your cloud infrastructure effectively. It can help you identify discrepancies between your deployed resources and your Terraform configurations, thus making it easier for engineers to address drift and enforce compliance in their IaC definitions.
Terratest
While Terratest is a go library for testing infrastructure code, it can be used to automate testing of infrastructure states, indirectly helping to identify drift by validating the resources with the Terraform configuration.
Driftctl
Driftcl is a dedicated Terraform tool for detecting drift that scans your infrastructure state and compares it with the actual state of your resources. This approach helps to quickly identify and address drift, ensuring your infrastructure aligns with the IaC definition.
TestInfra
TestInfra is another testing framework for your infrastructure and even though it is not dedicated to Terraform, it can be used to test the state of the infrastructure managed by Terraform. It helps in identifying configuration drifts by asserting the actual state of your infrastructure against expected configurations.
Kitchen-Terraform
Kitchen-Terraform integrates the test kitchen automation tool with Terraform, allowing you to define tests for your Terraform configurations. Similar to Terratest and TestInfra, it can verify your configurations against the actual state of your infrastructure, thus detecting drift.
Managing infrastructure drift is challenging because it may originate from any source. It is difficult to get absolute certainty of who changed what and when. The risk potential of such drift can range from low to critical, and the impact can affect the system’s security, cost, and reliability.
Terraform IaC and state files are the only reliable and predictable sources of information about the managed infrastructure. Spacelift’s drift detection encompasses monitoring and an intuitive UI to highlight the drift (and optionally automate the reconciliation). This makes it easy to identify what has changed and how to proceed with investigating it.Â
Note: New versions of Terraform are placed under the BUSL license, but everything created before version 1.5.x stays open-source. OpenTofu is an open-source version of Terraform that expands on Terraform’s existing concepts and offerings. It is a viable alternative to HashiCorp’s Terraform, being forked from Terraform version 1.5.6.
Detect and Remediate Drift with Spacelift
Drift happens, so let Spacelift deal with it. Spacelift provides drift detection capabilities to any IaC provider to enable the desired state for application infrastructure across teams, applications, and clouds.