Managing Infrastructure as Code (IaC) is the standard today as it offers greater advantages over traditional approaches. Terraform is a cloud platform-agnostic tool that is capable of managing infrastructure in the form of code.
The information about the infrastructure being managed by Terraform is stored in state files. The state files hold the mapping of real-world infrastructure and resource configurations. They play a very crucial role in Terraform IaC workflow as they provide the reference for all the operations performed on the infrastructure throughout the lifecycle.
Any change – not triggered by Terraform – that occurs to the infrastructure being managed by Terraform is known as a drift, and the infrastructure is termed as drifted.
In this post, we will explore the reasons why drifts happen, understand the risks associated with them, and explore the options to remediate these drifts.
One of the goals of managing infrastructure using Terraform is consistency. IaC makes it possible to maintain the consistency of multiple environments irrespective of how many times they are recreated.
In this section, we discuss a few common sources of infrastructure drift.
Manual changes to the infrastructure are one of the primary concerns that trigger the drift in infrastructure. Depending on the situation, manual changes are sometimes done intentionally or unintentionally.
When there is a need to change the configuration of a deployed system to address a critical production incident, manual changes are one of the fastest ways to fix it. Similarly, to address a certain network security vulnerability, certain network configurations are tweaked for testing purposes. These are a couple of examples of intentional manual changes to the infrastructure.
Sometimes infrastructure is manually changed without awareness. Identifying the components managed by Terraform is only sometimes intuitive. When users log into the web console, they may perform specific tasks on resources without the knowledge of Terraform’s state file. Executing scripts that make API calls to the cloud platform is also a possible source of unintentional change.
Irrespective of whether the change is intentional or unintentional if the changes are not ported back into the Terraform configurations, then this results in drift.
Organizations and large teams implement multiple automation tools to streamline operations. Unclear and wrongly implemented boundaries of influence for these tools cause an overlap of responsibilities.
For example, when Terraform is used for infrastructure management along with a configuration management tool like Ansible – there are high chances of infrastructure drift. Although Ansible is responsible for managing the application layer of a business service, it also has infrastructure provisioning capabilities.
Unless the purpose and scope of various automation tools are clearly defined, teams may use their overlapping capabilities to achieve their goals. Managing cloud resource configurations using multiple tools with specific workflows and lifecycle management capabilities causes infrastructure drifts.
Manual efforts are required to reconcile the changes caused due to another process into Terraform state files. Ironically, implementing more automation tools causes manual reconciliation efforts. We will discuss more addressing drifts later in the post.
Cloud platforms provide the ability to trigger certain event-driven user-defined scripts. These scripts provide flexibility to users to perform actions on the resource or execute API calls to modify another resource.
For example, when creating Linux-based EC2 instances in AWS, it is possible to execute bash/shell scripts when the instance boots. These scripts are provided in the user_data field when creating an instance from the web console. Similarly, Terraform provides a way to supply the same using IaC.
Even though it is not mandatory to provide user_data, it enables various automation capabilities to manage virtual machines. User_data scripts are used to run upgrades, install security patches, install dependencies, invoke various system processes, etc. as soon as the system boots.
Bash and shell scripts are powerful in a way that they can change any network configuration of the system, as well as execute API calls to modify other resources. This creates a high potential to introduce drift in infrastructure.
Terraform IaC is used to manage the end-to-end lifecycle of the infrastructure. It is not just responsible for creating and recreating cloud resources but also for introducing changes consistently. To achieve the same, up-to-date information is saved in the state files.
Drifts in infrastructure, in essence, are untracked changes. Such untracked changes pose risks with varied severity and have the potential to have a drastic impact on the system. Similarly, some changes may prove to be beneficial for the system in improving certain attributes like reliability, security, performance, etc.
Given the nature of infrastructure drift in the context of Terraform IaC, if they are not addressed can create blind spots while managing the infrastructure in scope. Infrastructure changes that fall out of the scope of Terraform management go unnoticed.
Infrastructure drifts result in exposing the security vulnerabilities of the system to attackers. This has the potential to cause serious damage not just to the system but to businesses in general. For example, when security group rules are manually modified to test a certain use case to have public access. This can have multiple impacts ranging from data breaches to the entire system being compromised.
Automated policy execution or manually changing the configuration can lead to failure in adhering to the regulatory requirements. For example, drifts that result in exposing the personal data of users to the public. Actions that enable unauthorized access to data and resources to users, etc.
Performance and operational difficulties
Infrastructure drifts can impact the performance of the systems resulting from increased latency or reduced network throughput, underprovisioning of resources, disabling of auto-scaling configurations, etc. Drifts also make it challenging to identify, analyze, and investigate the root cause of the issues. Unknown and untracked changes introduce challenges that increase downtime and also impact the mean time to resolution.
The nature of this change caused due to infrastructure drift can mean anything as far as financial implications are concerned. Provisioning of unutilized cloud resources results in unnecessary cloud platform costs. Since the changes are not tracked, remediation and maintenance become challenging, thus increasing costs.
To learn more about drift, check out the article – Infrastructure Drift Detection and How to Fix It With IaC Tools.
When infrastructure drift happens, the first challenge is to notice those drifts. As we have seen till now, there are multiple sources of drift. Thus it is not possible to track from where and when the drifts happen unless we have some monitoring in place.
It is easier to identify the existence of drift by running a couple of Terraform commands. The
terraform refresh command helps refresh the state file, and the
plan command provides a plan of action by analyzing the state file and current configuration. The output provided by the plan command helps us identify drifts.
Without changing the Terraform config, if the execution of a plan command suggests either modifying or recreating a certain resource – then this indicates that something else has modified the infrastructure. But this depends on the moment when we choose to run these commands. It often happens when we prepare and check the status before implementing other intended changes.
Periodic monitoring of the IaC-managed infrastructure to proactively check for drifts is a challenge. Drift detection provided by Spacelift helps with identifying and highlighting infrastructure drifts on time. Configuring a drift monitor is as simple as configuring a cron job.
As shown in the below screenshot, select a stack we wish to configure drift detection for and navigate to Settings > Scheduling. A couple of notable control options are provided here:
- Reconcile – When turned on, Spacelift automatically remediates the drift. When infrastructure drift is identified, Spacelift triggers the “terraform apply” workflow to restore the original state of infrastructure as per the Terraform configuration.
- Schedule – A simple cron job notation that determines the frequency to scan and compares the state of deployment. In the example below, the drift detection happens every 15 minutes.
When drifts are detected, it is represented in a very intuitive way which makes it easy to interpret the impact of drifts.
As shown in the screenshot below, we can quickly understand one of the network components has drifted.
Clicking on the drifted block quickly provides us with the details of the drift.
Read more in the documentation.
Whenever drifts are detected, it is important to understand the factors which caused them. As discussed previously, the changes introduced out of the scope of Terraform could either be desired or unwanted. Some examples,
- Changes caused due to overlapping scopes of automation tools are usually desired. However, the responsibility is not clearly defined.
- Changes were introduced manually to troubleshoot a related issue somewhere else but were forgotten and failed to be reverted. Such negligence is not desired as it exposes the system to various vulnerabilities and potentially causes cost implications.
- To address issues in critical services, hotfixes are implemented. Hotfixes are either permanent fixes or temporary workarounds. Thus it becomes difficult to classify all hotfixes as desired or unwanted and needs further investigation to conclude.
As seen from the examples above, understanding drifts need some analysis. Based on this, the course of remediation action usually boils down to the following –
- If the changes are desired, then import the configuration under Terraform management scope.
- If the changes are not desired, then reinstate the original state by running “terraform apply”.
- If a resource is not supposed to be managed by Terraform, then disassociate the same from Terraform state and configuration.
When drift detection is enabled, Spacelift highlights the drift in the very next run. It depends on how frequently the drift detection runs are configured. When we enable the “Reconcile” option in the drift detection schedule, Spacelift automatically triggers Terraform runs to reinstate the original configuration. This is suitable when the scopes are clearly defined, and resource management policies are firmly in place.
However, if the boundaries are not clearly defined yet, then it is recommended to turn off the “Reconcile” option. This is because there may be a need to either import drifts or disassociate infrastructure from the current Terraform configuration. The drift detection schedule again plays an important role in confirming mitigation actions post-import/disassociation.
Managing infrastructure drifts is a tricky subject since drifts may originate from any source. It is difficult to completely track the information about who changed what and when. On the other hand, the risk potential of such drifts can vary from low to critical, and the impact can happen on the security, cost, and reliability of the system.
The Terraform IaC and the state files are the only reliable and predictable sources of information about the managed infrastructure. In such cases, Spacelift provides much-needed drift detection by providing monitoring and an intuitive UI to highlight the drifts (and optionally automate the reconciliation). This makes it easy to know what has changed and provides us with the direction to investigate.
Manage Terraform Better with Spacelift
Build more complex workflows based on Terraform using policy as code, programmatic configuration, context sharing, drift detection, resource visualization, and many more.
Terraform State Cheatsheet
Grab our ultimate cheat sheet PDF and keep your IaC safe while managing State.