5 Ways to Manage Terraform at Scale

5 Ways to Manage Terraform at Scale

5 Ways to Manage Terraform at Scale

Learning Terraform can be easy. The HCL configuration language is simple enough to start with basic resource creation straight away. 

But—

Once the code that defines your infrastructure grows in complexity, maintainability, and dependency, an uphill battle can ensue. And that is why working out an efficient deployment workflow is crucial for further development.

In this article you will see five approaches to managing Terraform at scale, what their benefits and downsides are, and what problems they address.

1) Work Locally

A quick and easy way to start deploying resources using the Infrastructure as Code pattern is to use Terraform locally from the same machine you’re developing the code on. The only requirement is to have access to the target provider and an installed Terraform binary.

This approach makes it much easier to leverage commands related to day-to-day operations (such as state import, move, remove) or to get outputs than if doing it within the implementations where you have no direct access to the underlying provider.

You can also use all of the features of Terraform, such as remote state and state locking, to work with your teammates all-at-once. If you collaborate using a version control system and do not have continuous integration in place, but want to keep your code at a decent level of quality, there are tools to help you.

One of these is pre-commit-terraform that will run a set of selected tools to lint the code you’re creating before you even commit it. Even if you have continuous integration available, it’s a handy way to avoid wasting time waiting for results from pull request status checks.

What are the drawbacks of this approach? Each action is manual and must be performed directly by the developer, which leads to several issues. 

In a situation where multiple people are working on the same codebase, you could often encounter a scenario such as this one:

Person A applies their changes to the environment and follows up with a pull request. At the same time, Person B would like to deploy their changes but is blocked as the codebase is not up-to-date with Person A’s changes. Applying the changes in this state would cause damage to the resources created in the previous deployment. When working in larger teams, this will often result in a lot of wasted time just waiting and rebasing on a codebase to apply the changes.

If you happen to write unit tests for Terraform modules (and you should be), you’re sure to know they need to be executed manually from your computer. Running them every time you change something in your configuration can be tedious and time-consuming. If you do not yet write tests for your modules, you’d be well advised to start looking at Spacelift modules tests or Terratest.

Privileged access to an underlying vendor is required, which can easily result in a compromised environment. It’s much harder to keep restricted access within this model, as Terraform must be allowed to create, change, and destroy resources. The same rule applies to a scenario where access to private (internal) resources is required, e.g., every developer has to set up a VPN connection to interact with them.

Moreover, every person working with the codebase needs access to the Terraform state, which creates a security concern due to how Terraform state works. Even if you’re pulling sensitive data from an external system such as Vault, it will be stored within the state file in plaintext once used in a resource. A threat actor could simply run `terraform state pull` to access all the secrets!

Keeping all these issues in mind, let’s take a look at a more automated approach.

2) Implement Homegrown Automation

Implementing homegrown automation means incorporating Terraform workflow into your continuous integration / continuous deployment process. A specific example would need to directly tie to a CI/CD tool you are using, so let’s try to keep the example as generic as possible.

As the applying actor is the CI/CD system, the hard requirement of privileged access for developers is gone. You can grant this type of access to the execution layer while developers keep read-only permissions. This is true for private (internal) resources, and if your code is being executed on an agent that resides within the target infrastructure with the proper permissions having been set. In a model where everything has to go through the CI/CD pipeline, visibility of changes is improved as the system logs produced during the operation execution are available in real-time. 

Linting, compliance checks, and automated unit tests can be moved to pull request status checks, and it’s up to you to configure these steps of the pipeline.

But—

If your CI/CD is executing jobs concurrently, you will end up with race conditions causing pull requests to fail non-deterministically. 

It is not difficult to imagine a situation where two operators open a pull request within several seconds of one another. The pipeline gets triggered immediately for both, and one pull request will be marked as successful, while the other will be marked as failed. This situation will definitely happen if you are using state locking (it is a best practice and you really should be using it).

The explanation is simple: the first pull request has locked the state and therefore the second one could not, so it failed. This issue can be resolved by configuring queued runs, where only one pipeline can run at a time. However, this is not so simple with some CI/CD products.

The biggest issue and concern with this approach is the effort you have to put in to build a complete pipeline. If you consider requirements such as:

  • planning on pull requests
  • linting
  • unit tests
  • compliance checks
  • applying once merged
  • periodical drift detection
  • registry for all your versioned modules
  • modules sharing across different parts of the organization
  • shared variable contexts/reusable values

It might be tempting to implement solutions for those requirements by yourself. However, taking into account the time and effort required for doing it, you might want to take a look at alternatives, such as open-source tools.

3) Use Open Source

One of the most popular open-source tools to use with Terraform is Atlantis, which describes itself as “Terraform pull request automation”. You can read more about it here: Alternative to Atlantis.

4) Use Terraform Cloud

Terraform Cloud is an application provided by the company behind Terraform itself—HashiCorp. It is available for free for up to 5 users. Two plans are available for more users: “Team & Governance” and “Business”. As the product is created by Terraform authors themselves, any new feature that makes it to Terraform will be quickly available within Terraform Cloud as well. Neat!

Advantages of Terraform Cloud

Advanced security compliance mechanisms available through Sentinel allow you to enforce organization standards even before the code actually ships and creates resources with the chosen provider context.

Another important feature is a module registry that acts as an artifact repository for all your modules. It can be compared to JFrog Artifactory, Sonatype Nexus, or other similar software. Having every module in a separate repository and versioning allows you to control code promotion in a granular fashion or quickly roll back to if it allows you to go back to the most recent version only, then we should use “the”, but if it lets you go back to some other earlier version then we should use “a” previous version. 

Having a centralized place to store all the pieces also enables you to quickly browse through available modules, their documentation, inputs, outputs, and used resources. It is a great single pane of glass in terms of module management. However, it can only be used within the same Terraform Cloud organization and cannot be shared. This might be troublesome for operations teams that base their separation on the organization level.

There’s also the remote execution backend, which is an exclusive feature for Terraform Cloud. It allows you to work with Terraform on your local computer as you would usually do, upload your local codebase, and run a plan against it. It is a very efficient way of quickly verifying if what you are working on is actually working.

Disadvantages of Terraform Cloud

Along with so many benefits, there are some drawbacks as well. Any linting or module unit tests need to be implemented inside your own continuous integration system. This requires an additional effort the development team has to make to sustain their code quality.

Another disadvantage is that you are not able to share common values across your workspaces. Imagine a situation where you share a common set of variables like cloud region or environment name. The only way to achieve something similar to shared context is to have a dedicated workspace that will output this information as an input for all remaining workspaces, although this would be more of a workaround than an actual solution.

Compliance as code is available via Sentinel, a framework created by HashiCorp. Having a built-in solution to develop policies is nice, but operators often already use Open Policy Agent (OPA) to define these within different areas of the infrastructure. This frequently ends up in a doubled effort as you need to create a similar policy in both frameworks.

The biggest shortcoming of Terraform Cloud is in my opinion the lack of periodical drift detection runs. Once something is applied and the codebase does not get updates, you will never be notified if anything has changed in comparison to the state within the repository. Scheduling a daily “plan” to verify if everything is up-to-date would be highly beneficial to keep the infrastructure as described programmatically.

5) Use Spacelift

And here we are. You can compare Spacelift directly to Terraform Cloud. They share a lot of functionality with each other in terms of the actual deployment execution, but also in the area of compliance.

While Terraform Cloud uses Sentinel (the HashiCorp approach to implementation for compliance as code), Spacelift leverages Open Policy Agent. It is beneficial to have a similar way of working with policies for companies that already incorporate OPA in their workflow (e.g., for Kubernetes). As there is no need to learn another syntax, operators can focus their efforts on providing compliance, instead of learning yet another tool.

Spacelift also allows you to automate your workflow even further. By having a built-in continuous integration for modules, you can shift your linters or unit tests really quickly into one place dedicated to Terraform. It helps you reduce the effort required to implement integration workflow by letting you utilize a solution that you are already using.

Sometimes you need to work with a Terraform state directly to import something, for example. In scenarios where you are leveraging Terraform Cloud, it is often impossible to do this without accessing the provider yourself. In contrast, Spacelift gives you the ability to run literally any command within the context of a particular codebase via tasks. This way, you do not need to grant any additional permissions to import, move, or remove resources directly.

By utilizing contexts, you can simply share anything you like across multiple stacks—whether it is a set of environment variables or a file. This is something that becomes much more valuable once your codebase complexity increases.

As mentioned in the previous paragraph, drift detection is something that becomes much more important once your infrastructure grows. With Spacelift, it is possible to schedule a periodic drift detection on any stack. Going even further, you can enable automated remediation if any changes are found compared to what you have in your codebase.

Key Points

There are many ways of working with Terraform. Each way is different in terms of complexity, and has a different set of features. It is important to keep in mind that choosing one way or another should be based on business and technical requirements. Most times, there is no point in implementing an in-house solution as the cost and effort of building and maintaining it may often exceed its potential benefits. It is much easier and quicker to leverage platforms such as Spacelift to provide these features for you instead.

Share this post

twitter logo

Comments