Terraform

13 Biggest Terraform Challenges & Pitfalls (+ Fixes)

terraform pitfalls

Terraform is a powerful infrastructure-as-code (IaC) tool, but many teams hit the same pain points as they scale: remote state management, secrets ending up in state, configuration drift, module sprawl, slow plans and applies, and safe promotion across environments.

In this article, we’ll walk through 13 of the biggest Terraform challenges, with practical tips to help you build faster, safer workflows.

These challenges include:

  1. State management at scale
  2. Sensitive data ending up in state and plan artifacts
  3. Preventing and detecting configuration drift
  4. Taming the dependency graph and resource ordering
  5. Provider versioning and upgrade surprises
  6. Dealing with cloud API rate limits and eventual consistency
  7. Managing multiple environments without chaos
  8. Managing Terraform modules at scale
  9. Refactoring without accidental destroy/recreate
  10. Importing existing (brownfield) infrastructure
  11. Performance bottlenecks in large plans and applies
  12. Making changes safe: review, testing, and policy guardrails
  13. Licensing and governance uncertainty

1. State management at scale

Terraform state management gets tricky the moment your team and CI/CD start running Terraform in parallel. The terraform.tfstate file is Terraform’s “source of truth” for what it thinks exists. If two runs can write state at the same time (or the state is stored somewhere unreliable), you can end up with conflicting updates and painful recovery work.

If it’s stored locally, checked into Git by accident, or sitting in a remote bucket without locking, you’re one bad day away from conflicting applies, state corruption, or awkward “Who rotated the database password, and why is it in plaintext in terraform.tfstate?” conversations.

Terraform can lock state during operations that write it, but only if your chosen remote backend supports locking. You also shouldn’t reach for -lock=false except as a last resort.

A classic example: Two engineers both run terraform apply within a minute of each other. They each start from the same state snapshot, both make “correct” decisions locally, and then whichever apply finishes last writes the final state — potentially masking what the first run actually changed. Locking is what turns this from a footgun into a boring, predictable workflow.

Solution

The baseline best practice is remote state storage with state locking and strict access controls because state can contain sensitive values if secrets ever flow through resources, variables, or outputs.

If you’re on AWS, the S3 backend supports native locking with use_lockfile = true, and HashiCorp recommends enabling S3 bucket versioning so you can recover from accidental deletions or bad writes.

If you still rely on dynamodb_table for S3 locking, treat it as deprecated and plan a migration path, since DynamoDB-based locking is planned to be removed in a future minor version.

terraform {
  backend "s3" {
    bucket       = "acme-terraform-state"
    key          = "prod/network/terraform.tfstate"
    region       = "eu-central-1"
    use_lockfile = true
  }
}

And if you’d rather not manage state plumbing at all, Spacelift can manage Terraform/OpenTofu state as an optional backend during stack creation, handling state access in a way that’s tied to legitimate runs/tasks (instead of random ad-hoc credentials floating around).

2. Sensitive data ending up in state and plan artifacts

Terraform is good at not splashing secrets all over your terminal, but that can create a false sense of safety. Even when the CLI shows (sensitive value), the underlying state and plan data can still contain the real value, because Terraform needs a complete record of resource attributes to manage drift and future changes.

State and plan files may include sensitive values like initial database passwords or API tokens — and local state is stored in plaintext by default.

This becomes a real problem in CI/CD: It’s common to save terraform plan -out=tfplan and upload it as an artifact for a later apply job. That plan file can contain enough information to leak secrets if it’s accessible to the wrong people (or just ends up in the wrong place), turning “preview” artifacts into secret blobs you now have to secure like production credentials.

Here’s a simple example of how secrets sneak in. The plan output looks safe, but the value can still land in state (and in a saved plan file) unless you’re using features specifically designed to avoid persistence:

variable "db_password" {
  type      = string
  sensitive = true
}

resource "aws_db_instance" "app" {
  engine    = "postgres"
  username  = "app"
  password  = var.db_password
}

Solution

First, accept the model: Marking something as sensitive mostly affects display, not storage. The practical fix is to minimize the number of secrets that ever pass through Terraform, and to use newer Terraform and provider features when you do need to pass them.

One modern option is write-only arguments, which let you pass a secret to a resource without Terraform persisting it in state or plan files — exactly what you want for passwords, tokens, and one-time bootstrap values. Write-only arguments require Terraform 1.11 or newer, plus provider support, so treat them as a preferred option when your provider implements them. For short-lived values, Terraform 1.10 and newer also support ephemeral values, which are designed to keep secrets out of state and plan artifacts entirely in the cases they apply.

For everything else, treat state and plan artifacts as sensitive assets: Store state remotely with encryption and tight IAM, and be cautious about exporting or shipping plans between pipeline stages. State often contains secrets in plaintext, so lock down access accordingly.

Also, remember that a saved plan file captures the inputs used during planning, so you cannot “fix it later” by passing different -var or -var-file values at apply time. If you ship plan artifacts between stages, secure them and treat them as the exact decision you are applying.

3. Preventing and detecting configuration drift

Terraform works best when it’s the system of record, but real life loves exceptions. Someone “just hotfixes” a security group in the cloud console, an ops script tweaks a setting during an incident, or a managed service quietly adjusts a field behind the scenes.

Now your configuration, state, and actual infrastructure diverge, and the next time you run Terraform, it tries to reconcile that gap. Sometimes that’s exactly what you want.

Other times it’s a nasty surprise, like Terraform proposing to undo a deliberate emergency fix or replace a resource because a changed attribute forces recreation.

Solution

The most practical habit is making drift visible before it becomes a scary apply. Terraform’s normal plan/apply flow refreshes data during the run, but if you want a “tell me what changed out of band” check without changing infrastructure, refresh-only is designed for that.

Use terraform plan -refresh-only to see what Terraform would update in state to match reality. Only use terraform apply -refresh-only when you intentionally want to write those refreshed values back to state and outputs.

When drift is expected because another system legitimately manages part of the resource (autoscalers, organization-wide tagging tools, provider-controlled fields), explicitly model that shared ownership using lifecycle.ignore_changes so Terraform doesn’t “fight” changes that are meant to happen elsewhere.

If you want this to be continuous instead of “whenever someone remembers,” Spacelift can run scheduled drift detection by executing proposed runs against your stable stack state and reporting differences, with an option to automatically reconcile drift depending on how you configure it.

4. Taming the dependency graph and resource ordering

Terraform builds a dependency graph to figure out resource ordering and run as much as possible in parallel, based mostly on the references it can “see” in your configuration.

Trouble starts when the dependency is real but implicit: maybe a resource relies on a side effect (“this IAM policy must exist before that service can start”), or you’re passing IDs around as plain strings, so Terraform can’t infer the relationship.

That’s when people reach for the depends_on meta-argument. It works, but it can make plans more conservative and harder to predict. depends_on should be a last resort because it can lead to more unknowns (“known after apply”) and potentially broader changes than necessary, especially when used on modules.

Solution

Where possible, make Terraform infer the dependency naturally by wiring modules and resources together through real expression references (outputs to inputs), so the graph stays accurate without extra constraints. Save depends_on for truly “hidden” cases, and document why it exists so you don’t delete it in the future and reintroduce a flaky ordering bug.

5. Provider versioning and upgrade surprises

Terraform providers evolve fast, and even minor releases can change defaults, add new behaviors, or start validating things more strictly.

If you don’t constrain and lock provider versions, different laptops and CI runners can install different provider builds, which shows up as confusing diffs (“Why does your plan look different than mine?”) or behavior changes after a seemingly harmless terraform init.

Version constraints alone aren’t enough for reproducibility. Terraform uses constraints to decide what’s allowed and then records the exact chosen versions (plus checksums) in .terraform.lock.hcl so future runs make the same selections by default. If that lock file isn’t committed and consistently used, you can still get “works on my machine” drift between environments.

Solution

Define provider requirements with a sensible constraint and commit .terraform.lock.hcl so everyone (including automation) installs the same provider versions. Terraform will then reuse the locked versions on subsequent init runs unless you explicitly upgrade.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

When you do want to upgrade providers, make a deliberate change (ideally its own PR): Run terraform init -upgrade, review the lockfile diff, and re-run terraform plan to catch breaking changes before they hit production.

For CI, a useful guardrail is running terraform init -lockfile=readonly so the pipeline fails if it would need to change the lock file — meaning upgrades can’t slip in accidentally.

6. Dealing with cloud API rate limits and eventual consistency

Sometimes your Terraform code is fine and the cloud just isn’t ready yet. Big applies can hit API throttling (429s / “Rate exceeded”) because Terraform is doing lots of create, read, and update calls at once — and most providers enforce per-account or per-region limits.

Furthermore, many services are eventually consistent: The API accepts a change, but other endpoints won’t “see” it for seconds or minutes.

The result is the classic failure at 90%: A resource was created, but the next read or update says “not found” or “access denied” until propagation finishes. Providers often implement retries and waiters for these realities, but defaults won’t always match your environment.

Solution

The fastest win is usually dialing down concurrency. Terraform lets you limit how many operations run at once with -parallelism (default 10), which can dramatically reduce throttling on busy accounts.

terraform apply -parallelism=4

For eventual consistency and long-running operations, add resource timeouts where supported, so Terraform waits long enough for propagation instead of failing early (the exact knobs vary by resource).

Also watch for provider-specific retry settings and known propagation quirks. Some providers expose retry configuration or bake in backoff, but you may still need to tune around your account’s real limits.

7. Managing multiple environments without chaos

The moment you go from “a sandbox” to dev, staging, and prod, Terraform stops being a single workflow and turns into an environment discipline problem. You need isolation (so a staging experiment can’t touch prod), but you also want consistency (so prod isn’t a snowflake).

This is where teams get tangled: a mix of ad hoc *.tfvars, half-shared state, copy-pasted folders, and “please remember to switch to the right workspace” muscle memory.

Terraform workspaces can help by separating state for multiple instances of the same configuration, but they’re not a security boundary. Terraform’s docs warn they aren’t appropriate for deployments that require separate credentials and access controls — which real multi-environment setups usually need.

Solution

Design for “it should be hard to do the wrong thing.” Many teams do that by making each environment an explicit unit (often a separate root module directory and separate remote state) so there’s no ambiguity about what an apply targets.

Then layer in the non-negotiables: separate cloud accounts or projects where possible, distinct credentials and roles per environment, and a workflow that promotes changes forward (dev → staging → prod) instead of letting prod become the first test.

8. Managing Terraform modules at scale

When you have a handful of repos, modules feel like a convenience. At scale, modules become internal products that dozens of teams depend on — and that changes the rules.

Without clear ownership and a predictable release process, fragmentation creeps in: One team forks the “network module” to add a feature, another copy-pastes it to avoid waiting for reviews, and six months later you have five incompatible VPC implementations and no one is sure which one is “blessed.”

A common failure mode is interface drift. A module starts simple, then someone adds an input or changes a default. If consumers aren’t pinning versions (or they’re consuming from a moving Git branch), a routine init can pull a newer module revision and plans diverge across environments — or the change is breaking and causes failures everywhere.

Terraform’s docs call out that Git sources clone the default branch by default (so “HEAD changed” is a real risk), and that registry modules support explicit version constraints.

Solution

Define a module contract and stick to it: stable inputs and outputs, backward-compatible changes by default, and semantic versioning with release notes so consumers can upgrade intentionally.

In day-to-day usage, that usually means pinning your module source to a version (for registry modules) or to an immutable ref (for Git modules), instead of trusting whatever happens to be on the default branch. If you are sourcing modules from Git, make the pin explicit by using ref so you are not tracking the repository default branch by accident.

Read more: 10 Best Practices for Managing Terraform Modules at Scale

9. Refactoring without accidental destroy/recreate

Terraform doesn’t identify resources by “what they are” so much as where they live in the configuration (their resource address).

So a refactor that feels harmless — renaming a block, moving something into a module, splitting a module, switching to for_each — can look to Terraform like “old address disappeared, new address appeared,” which often translates into delete and create. For stateful resources (databases, buckets, clusters), that’s the kind of surprise you only need once.

The fix is telling Terraform, explicitly, “This is the same real object, just with a new address.” That’s what moved blocks are for: They let you declare the mapping in code so Terraform can update state as part of the plan/apply, instead of trying to recreate infrastructure.

# Before: resource "aws_s3_bucket" "logs" { ... }
# After:  resource "aws_s3_bucket" "audit_logs" { ... }

moved {
  from = aws_s3_bucket.logs
  to   = aws_s3_bucket.audit_logs
}

Solution

Favor moved blocks for refactors because they’re reviewable, repeatable, and travel with the code. If you truly need an out-of-band state change (or you’re refactoring older code and want a one-off), terraform state mv can rebind addresses in state, but it’s easier to misuse — so validate the resulting plan carefully.

In practice, the safe workflow is as follows: Commit the refactor and moved blocks together, run a plan, and make sure you see moves instead of destroys.

If you’re running through Spacelift, this is a good moment to lean on approval gates or policies that flag unexpected deletes during a refactor, so “it wants to recreate prod” doesn’t make it past the plan stage.

10. Importing existing (brownfield) infrastructure

Bringing already-running infrastructure under Terraform is rarely a clean cutover. The classic terraform import workflow can bind a real resource to Terraform state, but it doesn’t magically produce the configuration that matches what’s out there — and it’s inherently fiddly at scale (IDs to look up, addresses to map, and lots of “Why does the next plan want to change everything?”).

Another practical limiter: terraform import imports one resource at a time, not an entire collection like “a whole VPC.”

A typical brownfield surprise is importing something successfully and then immediately seeing a huge plan because your module defaults don’t match reality (tags, encryption flags, timeouts, nested blocks, provider defaults).

Importing gets you state alignment, but you still have to do the harder job: making your configuration reflect what you intend to keep versus what you want Terraform to change.

Solution

For bigger migrations, lean on config-driven import (import blocks) so imports become part of the normal plan/apply workflow instead of ad hoc state surgery. You can even write only the import blocks first and use terraform plan -generate-config-out=... to generate starter configuration, then prune it down to the minimal interface you actually want to support.

import {
  to = aws_s3_bucket.logs
  id = "acme-prod-logs"
}

The “scale” trick is to import incrementally: one service boundary or module at a time, run a plan, and only then move on. That way you validate convergence as you go, instead of discovering 800 diffs at the end.

11. Performance bottlenecks in large plans and applies

Terraform performance can fall off a cliff as your resource count grows.

A slow plan is usually slow for a boring reason: Terraform loads a large state file, builds the dependency graph, and then makes a lot of provider API reads to refresh reality before it computes a safe diff. As configuration and state get bigger, that refresh work turns plan/apply into a real productivity tax.

You can also feel extra latency with remote state backends or remote execution because you’ve added more moving parts: State reads and writes over the network, runner queue time, and sometimes longer “distance” between the runner and the cloud APIs. None of that is wrong — it just makes big runs feel even bigger.

Solution

The most reliable way to speed up Terraform is to reduce how much Terraform needs to evaluate for any single change. Instead of one mega-root module for an entire account, split into smaller stacks (by service, domain, or lifecycle) so day-to-day changes touch a smaller state, refresh fewer resources, and produce plans humans can actually review.

Then tune execution so you don’t overwhelm APIs. Terraform’s default parallelism can be too aggressive in busy accounts, and throttling plus retries can make runs slower and flakier than they need to be. Dialing parallelism to match real provider limits often makes runs more predictable.

12. Making changes safe: review, testing, and policy guardrails

At some point, the biggest risk isn’t “Terraform is wrong.” It’s that humans can’t reliably review what Terraform is saying. A plan with hundreds (or thousands) of changes is easy to rubber-stamp — and it’s hard to spot the one destructive action hiding in the noise.

Correctness also isn’t just syntax. A configuration can be valid and still violate your organization’s rules (“no public S3,” “only these regions,” “no wide-open security groups”), or break module expectations in subtle ways.

Solution

Treat the plan as an artifact machines can understand, not just a wall of text. Terraform can output JSON representations of plans and state via terraform show -json, which makes it much easier to build automated checks that answer questions like “Is anything being destroyed?” or “Are any resources becoming public?”

For module and root-module confidence, terraform test gives you a first-party way to run test files and validate expected behavior (and it can emit JUnit XML for CI reporting).

For guardrails that catch mistakes early, custom conditions (preconditions/postconditions and variable validations) let you fail fast with a clear message when a change violates assumptions.

Finally, policy as code is where this clicks: evaluate the JSON plan against rules (often written in Rego via Open Policy Agent) and block risky changes automatically. In Spacelift, policies can be applied at decision points in the workflow using plan JSON as input — so you’re not relying on heroic human review to catch the scary stuff.

13. Licensing and governance uncertainty

For a lot of teams, “Terraform risk” isn’t technical — it’s licensing and governance. Terraform’s license changed to Business Source License 1.1 in August 2023, which created uncertainty for anyone redistributing Terraform, embedding it in products, or offering IaC as a hosted service.

Many organizations can keep using Terraform internally, but the gray area is usually “Are we building something that could be considered competitive?” That question tends to trigger legal review and slow platform roadmaps.

Governance adds a second layer: when a single vendor controls the roadmap, release cadence, and contribution process, teams need to plan for the possibility of future shifts (license terms, deprecations, feature direction) that ripple through their infrastructure workflow.

Solution

Make licensing a repeatable checklist, not a one-time fire drill: Document how Terraform is used (internal only vs. redistributed), who consumes it (engineers vs. customers), and where artifacts land (plans, state, modules).

If there’s any chance you’re packaging or hosting Terraform as part of a commercial offering, get an explicit legal read using the vendor’s licensing guidance instead of relying on assumptions.

Then reduce lock-in by designing for portability: keep module interfaces clean, avoid coupling workflows to one specific hosted backend, and pin tool and provider versions so governance changes can’t sneak in via accidental upgrades.

Finally, it’s worth having a realistic alternative on the table. OpenTofu was created in response to the license change, is MPL-2.0 licensed, and is now a CNCF Sandbox project.

If you decide to hedge (or migrate), Spacelift supports both Terraform and OpenTofu — so you can keep the workflow consistent while you reduce licensing uncertainty.

How can Spacelift help solve your infrastructure challenges

Terraform is powerful, but at scale, you usually need an orchestration layer around it, something that standardizes runs, handles state safely, enforces policies, and gives you an audit trail that doesn’t depend on who ran what from where.

Spacelift helps teams operationalize Terraform and OpenTofu (and other IaC tools) by providing a purpose-built workflow for planning, applying, approvals, and governance.

It also supports a two-path deployment model: a rigorous IaC and GitOps path for production workflows, plus a faster Intent path for experiments and non-critical requests, both under the same guardrails and visibility layer.

With Spacelift you get:

  • Policies (based on Open Policy Agent) – You can control how many approvals you need for runs, what kind of resources you can create, and what kind of parameters these resources can have, and you can also control the behavior when a pull request is open or merged.
  • Multi-IaC workflows – Combine Terraform with Kubernetes, Ansible, and other infrastructure-as-code (IaC) tools such as OpenTofu, Pulumi, and CloudFormation,  create dependencies among them, and share outputs
  • Build self-service infrastructure – You can use Blueprints to build self-service infrastructure; simply complete a form to provision infrastructure based on Terraform and other supported tools.
  • Integrations with any third-party tools – You can integrate with your favorite third-party tools and even build policies for them. For example, see how to integrate security tools in your workflows using Custom Inputs.

Spacelift also supports private workers, so you can execute workflows inside your own environment while keeping access scoped and auditable. Read the documentation for more information on configuring private workers.

You can check it out for free by creating a trial account or booking a demo with one of our engineers.

Key points

Terraform scales best when you treat it less like a CLI and more like a platform: reliable state, clear environment boundaries, disciplined module ownership, and guardrails that catch risky changes before they ship.

The good news is that most Terraform pain is predictable — secrets creeping into artifacts, drift, provider surprises, slow runs, and risky refactors all have patterns and proven fixes.

If you take one thing from these challenges, make it this: Optimize for repeatability and safety, not heroics. Tighten the workflow, automate the checks, and keep stacks and ownership models simple enough that the next person can reason about them.

And if you want fewer moving parts to hand-roll, tools like Spacelift can help standardize runs, enforce policies, and keep state, approvals, and audit trails in one place — so scaling Terraform feels like adding capacity, not adding chaos.

Manage Terraform better and faster

If you are struggling with Terraform automation and management, check out Spacelift. It helps you manage Terraform state, build more complex workflows, and adds several must-have capabilities for end-to-end infrastructure management.

Learn more

Frequently asked questions

  • What is the biggest challenge of Terraform?

    The biggest Terraform challenge is operational safety at scale: keeping state, secrets, environments, and workflows consistent so changes are predictable instead of surprising.

  • Why are people moving away from Terraform?

    Many teams are reassessing Terraform because the relicensing to Business Source License weakened open source guarantees and increased governance and redistribution risk. That uncertainty helped drive adoption of OpenTofu under a more open governance model.

  • Will AI replace Terraform?

    AI won’t replace Terraform soon. In practice, it will help generate HCL, suggest module patterns, and catch policy or drift issues earlier. Terraform remains the execution and state-management layer for predictable, auditable infrastructure changes, while AI improves authoring and review.

Terraform Project Structure
Cheat Sheet

Get the Terraform file & project structure

PDF cheat sheet.

terraform files cheat sheet bottom overlay
Share your data and download the cheat sheet