Terraform

5 Ways to Manage Terraform at Scale – Best Practices

5 ways to manage Terraform at Scale

Learning Terraform can be easy. The HCL configuration language is simple enough to start with basic resource creation straight away. 

But—

Once the code that defines your infrastructure grows in complexity, maintainability, and dependency, an uphill battle can ensue. And that is why working out an efficient deployment workflow is crucial for further development.

In this article you will see five approaches to managing Terraform workflows at scale, what their benefits and downsides are, and what problems they address.

Note: New versions of Terraform will be placed under the BUSL license, but everything created before version 1.5.x stays open-source. OpenTofu is an open-source version of Terraform that will expand on Terraform’s existing concepts and offerings. It is a viable alternative to HashiCorp’s Terraform, being forked from Terraform version 1.5.6. OpenTofu retained all the features and functionalities that had made Terraform popular among developers while also introducing improvements and enhancements. OpenTofu is the future of the Terraform ecosystem, and having a truly open-source project to support all your IaC needs is the main priority.

1) Work Locally

A quick and easy way to start deploying resources using the Infrastructure as Code pattern is to use Terraform locally from the same machine you’re developing the code on. The only requirement is to have access to the target provider and an installed Terraform binary.

This approach makes it much easier to leverage commands related to day-to-day operations (such as state import, move, remove) or to get outputs than if doing it within the implementations where you have no direct access to the underlying provider.

Even though this is the easiest way to work with Terraform, this is not a best practice, because as soon as you scale and you have multiple DevOps Engineers working on your project, it will become very complicated.

You can also use all of the features of Terraform, such as remote state and state locking, to work with your teammates all-at-once. If you collaborate using a version control system and do not have continuous integration in place, but want to keep your code at a decent level of quality, there are tools to help you.

One of these is pre-commit-terraform that will run a set of selected tools to lint the code you’re creating before you even commit it. Even if you have continuous integration available, it’s a handy way to avoid wasting time waiting for results from pull request status checks.

What are the drawbacks of this approach? Each action is manual and must be performed directly by the developer, which leads to several issues. 

In a situation where multiple people are working on the same codebase, you could often encounter a scenario such as this one:

Person A applies their changes to the environment and follows up with a pull request. At the same time, Person B would like to deploy their changes but is blocked as the codebase is not up-to-date with Person A’s changes. Applying the changes in this state would cause damage to the resources created in the previous deployment. When working in larger teams, this will often result in a lot of wasted time just waiting and rebasing on a codebase to apply the changes.

If you happen to write unit tests for Terraform modules (and you should be), you’re sure to know they need to be executed manually from your computer. Running them every time you change something in your configuration can be tedious and time-consuming. If you do not yet write tests for your modules, you’d be well advised to start looking at Spacelift modules tests or Terratest.

Privileged access to an underlying vendor is required, which can easily result in a compromised environment. It’s much harder to keep restricted access within this model, as Terraform must be allowed to create, change, and destroy resources. The same rule applies to a scenario where access to private (internal) resources is required, e.g., every developer has to set up a VPN connection to interact with them.

Moreover, every person working with the codebase needs access to the Terraform state, which creates a security concern due to how Terraform state works. Even if you’re pulling sensitive data from an external system such as Vault, it will be stored within the state file in plaintext once used in a resource. A threat actor could simply run `terraform state pull` to access all the secrets!

Keeping all these issues in mind, let’s take a look at a more automated approach.

2) Implement Homegrown Automation

Implementing homegrown automation means incorporating Terraform workflow into your continuous integration / continuous deployment process. A specific example would need to directly tie to a CI/CD tool you are using, so let’s try to keep the example as generic as possible.

As the applying actor is the CI/CD system, the hard requirement of privileged access for developers is gone. You can grant this type of access to the execution layer while developers keep read-only permissions. This is true for private (internal) resources, and if your code is being executed on an agent that resides within the target infrastructure with the proper permissions having been set. In a model where everything has to go through the CI/CD pipeline, visibility of changes is improved as the system logs produced during the operation execution are available in real-time. 

Linting, compliance checks, and automated unit tests can be moved to pull request status checks, and it’s up to you to configure these steps of the pipeline.

But—

If your CI/CD is executing jobs concurrently, you will end up with race conditions causing pull requests to fail non-deterministically. 

It is not difficult to imagine a situation where two operators open a pull request within several seconds of one another. The pipeline gets triggered immediately for both, and one pull request will be marked as successful, while the other will be marked as failed. This situation will definitely happen if you are using state locking (it is a best practice and you really should be using it).

The explanation is simple: the first pull request has locked the state and therefore the second one could not, so it failed. This issue can be resolved by configuring queued runs, where only one pipeline can run at a time. However, this is not so simple with some CI/CD products.

The biggest issue and concern with this approach is the effort you have to put in to build a complete pipeline. If you consider requirements such as:

  • planning on pull requests
  • linting
  • unit tests
  • compliance checks
  • applying once merged
  • periodical drift detection
  • registry for all your versioned modules
  • modules sharing across different parts of the organization
  • shared variable contexts/reusable values

It might be tempting to implement solutions for those requirements by yourself. However, taking into account the time and effort required for doing it, you might want to take a look at alternatives, such as open-source tools.

3) Use Open Source

One of the most popular open-source tools to use with Terraform is Atlantis, which describes itself as “Terraform pull request automation”.

Atlantis provides a way for teams to collaborate on Terraform code changes by creating a pull request workflow. With Atlantis, developers can create, review, and merge Terraform code changes using familiar Git-based workflows. Atlantis also provides features such as automated plan and apply workflows, automatic branch cleanup, and integration with popular collaboration tools like Slack and GitHub.

It doesn’t offer a dedicated User Interface, you will need to use the one provided by your VCS provider, and if you want to use more advanced features like Policy as Code or Drift Detection, you will need to create the integrations inside of your pipeline, as it doesn’t come with these out of the box.

You can read more about it here: Alternative to Atlantis.

4) Use Terraform Cloud

Terraform Cloud is an application provided by the company behind Terraform itself—HashiCorp. It is available for free for up to 5 users. Two plans are available for more users: “Team & Governance” and “Business”. As the product is created by Terraform authors themselves, any new feature that makes it to Terraform will be quickly available within Terraform Cloud as well. Neat!

See the comparison: Atlantis vs. Terraform Cloud.

Advantages of Terraform Cloud

Advanced security compliance mechanisms available through Sentinel allow you to enforce organization standards even before the code actually ships and creates resources with the chosen provider context.

Another important feature is a module registry that acts as an artifact repository for all your modules. It can be compared to JFrog Artifactory, Sonatype Nexus, or other similar software. Having every module in a separate repository and versioning allows you to control code promotion in a granular fashion or quickly roll back to if it allows you to go back to the most recent version only, then we should use “the”, but if it lets you go back to some other earlier version then we should use “a” previous version. 

Having a centralized place to store all the pieces also enables you to quickly browse through available modules, their documentation, inputs, outputs, and used resources. It is a great single pane of glass in terms of module management. However, it can only be used within the same Terraform Cloud organization and cannot be shared. This might be troublesome for operations teams that base their separation on the organization level.

There’s also the remote execution backend, which is an exclusive feature for Terraform Cloud. It allows you to work with Terraform on your local computer as you would usually do, upload your local codebase, and run a plan against it. It is a very efficient way of quickly verifying if what you are working on is actually working.

Disadvantages of Terraform Cloud

Along with so many benefits, there are some drawbacks as well. Any linting or module unit tests need to be implemented inside your own continuous integration system. This requires an additional effort the development team has to make to sustain their code quality.

Integrations are limited to what Run Tasks are supporting at any given time, meaning that you cannot really own your workflow. You will need to manage with what you are receiving from Terraform Cloud. 

Another disadvantage is the fact that you cannot control the steps that are happening before and after an operation (init/plan/apply). Run tasks can be added before or after a plan and also before an apply, but you cannot add them before an init or after an apply, which again shows a lack of flexibility, and you will need to build pipelines in Github Action or whatever other CI/CD tool you prefer, to do some simple checks before running your code. The only thing that you can actually do, is run remote Terraform command execution using the CLI, which is not that helpful overall. 

The biggest drawback is the fact that Terraform Cloud doesn’t support any other products apart from Terraform. Of course, this is not its purpose, but as many organizations are using Kubernetes and Ansible in conjunction with Terraform, this would mean that you need to adopt other tools like ArgoCD and Ansible Tower to manage your workflows end to end.

5) Use Spacelift

And here we are. You can compare Spacelift directly to Terraform Cloud. They share a lot of functionality with each other in terms of the actual deployment execution, but also in the area of compliance.

While Terraform Cloud uses Sentinel mostly (the HashiCorp approach to implementation for compliance as code), Spacelift leverages Open Policy Agent. It is beneficial to have a similar way of working with policies for companies that already incorporate OPA in their workflow (e.g., for Kubernetes). As there is no need to learn another syntax, operators can focus their efforts on providing compliance, instead of learning yet another tool.

With Spacelift’s custom inputs, you can easily integrate any tool inside your workflow, but what really takes this to the next level is the fact that you can even write policies for it.

Do you want tfscan as part of your workflow? Install it and run it in a before_init hook and save the output to a file with the following format <key>.custom.spacelift.json (where key can be anything you like). Access the data in a plan policy under input.third_party_metadata.custom.<key> and you can write whatever policy you want. Don’t know what data is getting exposed?

Use Spacelift’s policy workbench and play around with the data until you easily understand what policies you can write based on the output of the integrated tool.

Spacelift also allows you to automate your workflow even further. By having a built-in continuous integration for modules, you can shift your linters or unit tests really quickly into one place dedicated to Terraform. It helps you reduce the effort required to implement integration workflow by letting you utilize a solution that you are already using.

Stack Dependencies can help you easily deploy everything that you need in one go and building these dependencies is just one click away. As Stack Dependencies are Directed Acyclic Graphs (DAG), one stack can depend on multiple stacks and it can be depended by multiple stacks, but loops cannot exist. This unlocks the possibility of creating a sophisticated workflow without having to build complex pipelines.

Sometimes you need to work with a Terraform state directly to import something, for example. In scenarios where you are leveraging Terraform Cloud, it is often impossible to do this without accessing the provider yourself. In contrast, Spacelift gives you the ability to run literally any command within the context of a particular codebase via tasks. This way, you do not need to grant any additional permissions to import, move, or remove resources directly.

By utilizing contexts, you can simply share anything you like across multiple stacks—whether it is a set of environment variables or a file. This is something that becomes much more valuable once your codebase complexity increases.

As mentioned in the previous paragraph, drift detection is something that becomes much more important once your infrastructure grows. With Spacelift, it is possible to schedule periodic drift detection on any stack. Going even further, you can enable automated remediation if any changes are found compared to what you have in your codebase.

Best practices for managing Terraform at scale

Regardless of how you decide to manage your Terraform code base, some best practices apply:

  • Use modules – Modules in Terraform allow you to package a collection of resources as a single logical unit. By doing so, you can create reusable components, making your codebase DRY and consistent. Modularizing code also improves code clarity and reduces duplication. 
  • Leverage Dynamic Blocks – Dynamic Blocks in Terraform let you conditionally create block constructs based on input variables or other logic. This can help reduce repetition in your code and adapt configurations to different scenarios or environments without significant structural changes.
  • Use loops and conditionalsCount and for_each, allow you to iterate over lists or maps to create resources dynamically. This is particularly useful for scenarios where resource quantities or configurations are different based on input variables. Additionally, conditionals using the ternary(? :) operator can enable or disable configurations based on certain criteria. You can also leverage the for loop when you are building expressions and you also have the ability to use if statements in the loop, allowing you to build any expression you want based on your input.
  • Take advantage of variable validations – Starting from Terraform 0.13, you can include validation rules for input variables, ensuring that the values provided meet certain conditions before Terraform runs your configuration. This can prevent common configuration errors and ensure that provided input uses the expected formats or constraints.
  • Manage your state remotely with locking enabled – Using remote backends like AWS S3 paired with DynamoDB for state locking enables safer collaboration among team members. Remote state management ensures state consistency, while state locking prevents concurrent modifications, reducing the risk of state corruption or unintended infrastructure changes.
  • Scan your code for security vulnerabilities – Security is mandatory, especially when defining infrastructure. Tools like Checkov, tfsec, or Terrascan can scan your Terraform code for potential security misconfigurations or non-compliant definitions. Regularly scanning and addressing these vulnerabilities reduces the attack surface of your deployed infrastructure.
  • Implement Policies – Open Policy Agent (OPA) offers a flexible way to define and enforce policies across your Terraform configurations. Using a declarative language called Rego, you can create policies that restrict certain resource configurations, enforce naming conventions, or ensure compliance. Integrating OPA checks into your CI/CD pipeline helps maintain consistent adherence to these policies as your Terraform code evolves.
  • Implement Linting – Linting refers to the process of analyzing code for potential errors, code inconsistencies, and ensuring it matches the best practices. In the context of Terraform, linting can help catch syntax erros, deprecated code usage, or misconfigurations before they cause issues during an apply.
  • Test your code – Testing is a fundamental part of any software development project. Using tools such as Terratest or Kitchen-Terraform can help with writing integration tests for your Terraform configurations, ensuring your code is working properly before being deployed to the production environment/
  • Write thorough documentation – Documentation is vital for understanding the purpose and usage of your Terraform configurations, especially when collaborating with others. Terraform-docs is a handy tool that can automatically generate documentation for your Terraform modules, providing an overview of inputs, outputs, providers, and more. Integrating this into your CI/CD pipeline ensures that your documentation stays up-to-date with your code changes.

Key Points

There are many ways of working with Terraform. Each way is different in terms of complexity, and has a different set of features. It is important to keep in mind that choosing one way or another should be based on business and technical requirements. Most times, there is no point in implementing an in-house solution as the cost and effort of building and maintaining it may often exceed its potential benefits. It is much easier and quicker to leverage platforms such as Spacelift to provide these features for you instead.

Discover better way to manage Terraform

Spacelift helps manage Terraform state, build more complex workflows, supports policy as code, programmatic configuration, context sharing, drift detection, resource visualization and many more.

Start free trial
Terraform CLI Commands Cheatsheet

Initialize/ plan/ apply your IaC, manage modules, state, and more.

Share your data and download the cheatsheet