Ansible Configuration Drift Management, Detection & Fixes

TL;DR

Configuration drift occurs when managed systems diverge from their intended state, most often due to manual troubleshooting changes, unmanaged updates, inconsistent automation, or cloud-console ClickOps that never makes it back into the IaC repository.

Ansible manages drift by enforcing the desired state with declarative, idempotent playbooks and by detecting drift using check mode plus --diff to preview exactly what would change.

For continuous monitoring, baseline Ansible facts and schedule regular verification runs via AWX/Ansible Automation Platform or Spacelift, then codify intentional changes in playbooks or re-apply them to revert unauthorized drift.

When managing servers at scale, keep in mind that change will be constant and will most likely happen outside your control.

For instance, you could have a developer who fixes a production issue by manually editing a config file. Or a script updating package versions outside your playbooks, or just someone who tweaks settings through the AWS console. All of this could happen, and before long, your servers don’t match what your Ansible playbooks say they should be.

This state is referred to as configuration drift, and it breaks the promise of automation. When this happens, your playbooks stop being reliable, deployments fail for reasons you didn’t expect, and troubleshooting becomes harder because you’re not sure if servers actually match what’s documented.

This is going to be the focus of this article. Here, we’ll walk through configuration drift in Ansible: what it is, why it happens, how to spot it, and how to deal with it.

What we’ll cover:

What does configuration drift mean in Ansible?

Configuration drift happens when your managed systems don’t match their desired state in your Ansible playbooks. It’s the difference between what you think your infrastructure should be and what’s actually there.

When you run an Ansible playbook, you set a target state for your infrastructure, such as package versions, configuration file contents, and so on. Ansible pushes these states out to your managed nodes.

Drift happens when something changes those systems outside of Ansible. Sometimes the change is intentional, like when you apply a hotfix directly to production. However, whatever the reason might be, you just end up in a state where your servers don’t match what your code says anymore.

Now, this is a problem because Ansible’s idempotent design expects to know the current state. When drift exists, playbooks might fail, skip necessary changes, or produce unexpected results. The infrastructure you think you have isn’t the infrastructure you actually have.

Unlike infrastructure as code tools like Terraform that maintain state files, Ansible doesn’t track what it has previously deployed. Rather, it checks the current state of systems when it runs and applies changes to match the desired state. This works great when systems only change through Ansible, but breaks down when drift occurs.

Why does configuration drift happen in Ansible-managed environments?

Drift has several common causes in Ansible environments and understanding them helps you prevent them.

Manual changes during incidents — When something breaks in production at 2 am, engineers SSH in and fix it directly (change a config file, restart a service, update a package, etc.). The service comes back, but the change often never makes it into the Ansible playbooks. Later, the fix sits there until someone updates the playbooks, which may or may not happen. Then the next time Ansible runs, it can either break or wipe out the manual fix.
Multiple automation tools — Most teams use several tools together: Terraform to build infrastructure, Ansible to configure systems, scripts for backups, monitoring tools that adjust capacity, and Kubernetes for containers. Problems show up when tools overlap or touch the same resources. Terraform might manage EC2 tags while Ansible also tries setting them, or backup scripts might modify files that Ansible templates control. When “who owns what” isn’t clear, management gets confusing fast.
Package updates and system changes — Operating systems can update packages automatically, applying security patches and shifting dependencies outside Ansible’s workflow. That can break assumptions in your playbooks. For example, your playbook expects Nginx 1.18, but automatic updates install 1.20 and the config syntax changes,so your template fails. Or a security update modifies a config file that your playbook overwrites on the next run.
Cloud provider changes — Cloud platforms change resources through their own automation. Auto-scaling groups add and remove instances, load balancers update health checks, security groups get modified in the console, and cost optimization tools right-size instances based on usage patterns. Those changes don’t flow through Ansible, so your playbooks may describe infrastructure that no longer exists, or that behaves differently than expected.
Team collaboration challenges — Multiple teams often manage the same infrastructure (platform sets base configuration, security applies hardening, application teams deploy services), and each maintains their own playbooks. Without coordination, playbooks conflict: one team’s changes overwrite another’s, dependencies aren’t always clear, and the last team to run their playbook effectively “wins,” reverting everyone else’s changes.

How to detect Ansible drift

Finding drift requires comparing what should exist against what actually exists. Ansible provides several approaches.

1. Run playbooks in check mode

Check mode runs your playbook without making changes. It shows what would change if you ran it normally. This reveals drift because any reported changes indicate the system doesn’t match your desired state.

ansible-playbook site.yml --check --diff

The --check flag simulates execution, while the --diff flag shows file differences line by line. If check mode reports changes on infrastructure you haven’t intentionally modified, you’ve found drift.

Check mode has limitations. Tasks that use command or shell modules can’t determine if changes are needed without actually running. Tasks that register variables or use previous task results might fail. Complex playbooks with conditional logic often break in check mode.

Using this effectively means structuring your playbooks to be check-mode compatible. Use built-in modules instead of shell commands when possible. Design tasks to work without state from previous tasks.

2. Compare system facts against baselines

Ansible automatically collects facts (system information such as package versions, running services, file checksums, and OS details) for each managed node. You can gather these facts and compare them against known good baselines to detect drift.

- name: Gather system facts
  hosts: webservers
  tasks:
    - name: Collect current state
      setup:
        gather_subset:
          - all
      register: current_facts
    
    - name: Compare against baseline
      assert:
        that:
          - ansible_distribution_version == baseline_os_version
          - ansible_kernel == baseline_kernel_version
        fail_msg: "System facts don't match baseline"

This approach works well for slowly changing configurations like OS versions and installed packages. It’s less effective for rapidly changing states, such as temporary files or dynamic configurations. You’ll need to establish your baseline by capturing facts from a correctly configured reference system, then use that as your comparison point.

3. Scheduled verification runs

Run your playbooks on a regular schedule purely for verification. These scheduled runs in check mode act as continuous drift detection. Set them up to run hourly or daily, depending on how quickly you need to catch drift.

# Add this cron job to your control node's crontab
0 */6 * * * ansible-playbook /etc/ansible/site.yml --check --diff > /var/log/ansible-drift.log 2>&1

Parse the output to identify changes and send alerts when drift is detected. This gives you visibility into when and where drift occurs, even if you don’t automatically remediate it.

The challenge is handling false positives. Some tasks always report changes even when nothing meaningful has changed. You’ll need to tune your playbooks to minimize noise.

4. File integrity monitoring

For critical configuration files, use Ansible to monitor checksums. Store expected checksums (captured from your known-good configuration) in variables and compare them against the current ones. This catches unauthorized file modifications.

- name: Check nginx config integrity
  stat:
    path: /etc/nginx/nginx.conf
    checksum_algorithm: sha256
  register: nginx_config
  
- name: Alert on config drift
  fail:
    msg: "Nginx config has been modified outside Ansible"
  when: nginx_config.stat.checksum != expected_checksum  # Define expected_checksum in your variables

This works for static configuration files, but doesn’t help with dynamic resources or settings that change legitimately.

5. Use Ansible Tower and AWX built-in features

If you use Ansible Automation Platform (formerly Ansible Tower) or AWX (the open-source upstream project), you get centralized drift detection features that let you schedule playbook runs in check mode across your inventory, view results in a dashboard, and track which systems drift most frequently.

Tower provides job templates that can run in check mode on a schedule. It logs the results and can trigger notifications when drift is detected. This gives you enterprise-grade drift detection without building it yourself.

IaC configuration languages, combined with robust IaC orchestration, collaboration, and automation tools such as Spacelift, can effectively detect, prevent, and manage drift. IaC tooling also enables automated provisioning and updates, reducing the risk of manual errors or inconsistencies that cause drift.

Spacelift comes with a built-in mechanism to detect and, optionally, reconcile drift. It works by periodically executing proposed runs on your stable infrastructure (in Spacelift, it is generally represented it by the FINISHED stack state) and checking for any changes.

Remediation process for Ansible drift

Once you detect drift, you need to decide what to do about it. Remediation takes different forms depending on whether the drift was intentional or accidental.

1. Evaluate the drift

Not all drift is bad. Sometimes, the changes made outside Ansible are improvements that should be kept. Other times, they’re mistakes that should be reverted.

Check what changed and why. Look at the diff output from check mode. Identify which resources drifted. Talk to the teams that manage those systems. Determine if the changes were intentional and necessary.

If the changes represent improvements, update your playbooks to include them, and if they were mistakes or unauthorized changes, revert them by running your playbooks normally.

2. Update playbooks to match desired changes

When drift represents intentional improvements, capture them in your playbooks. This makes the changes permanent and repeatable.

Someone manually tuned an application’s memory settings for better performance. Those settings work well and should be preserved. Update the playbook to include the new values.

# Update playbook to reflect production improvements
- name: Configure application memory
  lineinfile:
    path: /etc/app/config.ini
    regexp: '^memory_limit'
    line: 'memory_limit=2048M'  # Updated based on production tuning

Document why you made the change and link to incident reports or performance data. This creates an audit trail and helps future maintainers understand the decision.

3. Revert unauthorized changes

For drift that shouldn’t exist, run your playbooks normally to restore the desired state. Ansible will detect the differences and make changes to match what the playbooks define.

ansible-playbook site.yml --limit webservers

This automatically reverts the drift. Systems return to their defined state. The --limit flag lets you target specific hosts without affecting your entire infrastructure.

Before reverting, make sure you understand the impact. The drift might have been someone’s workaround for a problem. Reverting it could break something or bring back the original issue. Test in non-production first when possible.

4. Implement automatic remediation

For environments where maintaining the exact state is critical, set up automatic drift remediation. Schedule your playbooks to run regularly in normal mode, not just check mode.

# Add remediation cron job to your control node
0 */4 * * * ansible-playbook /etc/ansible/site.yml --limit production > /var/log/ansible-remediation.log 2>&1

This continuously enforces your desired state. Any drift gets automatically corrected within your scheduled interval. It works best for infrastructure where changes should only come through Ansible.

The risk is that automatic remediation might revert legitimate emergency fixes or conflict with other automation. Use this carefully and monitor the results.

5. Track and analyze drift patterns

Log every time you detect or fix drift. Store the data centrally where you can review it later. You can use ELK stack, Splunk, or even a simple database. Look for patterns that point to bigger problems.

Are the same files drifting repeatedly? You probably have conflicting automation or unclear ownership. Does drift appear right after deployments? Then your deployment process isn’t using Ansible properly. Does one team cause more drift than the other? Then they need better training or tools.

Use these patterns to prevent drift, not just fix it over and over.

💡 You might also like:

Best practices for managing configuration drift in Ansible

Preventing drift is better than detecting and fixing it. These practices reduce how often drift occurs.

Enforce Ansible as the single source of truth

Make a rule that configuration changes must be applied through version-controlled Ansible playbooks — not direct SSH, manual file edits, or ad hoc scripts.

To make that stick, teams need a fast, approved path. Set up self-service playbook execution through a governed runner (for example, Spacelift or Ansible Automation Platform/AWX), with role-based access control, approvals, and a clear audit trail.

Write the policy down. Explain what it prevents (drift, inconsistent changes, and untracked access). Provide playbooks for common tasks, and treat exceptions as time-bound and explicitly approved, not the default.

Version control everything

Store all playbooks, roles, variables, and inventory files in version control. This gives you a complete history of infrastructure changes. You can see what changed, when, and why.

Use pull requests for changes. Require reviews before merging. Run automated tests against playbooks before deploying them. This catches errors before they affect production and creates accountability for changes.

Version control also makes it easy to roll back bad changes. When drift happens because someone deployed a bad playbook, revert to the previous version and redeploy.

Build playbooks with idempotence in mind

Ansible modules are designed to be idempotent. Running the same playbook multiple times produces the same result. The system ends up in the desired state regardless of its starting state.

This only works if you use modules correctly. Avoid shell and command modules when built-in modules exist. They can’t check the current state properly, which breaks idempotence.

# Avoid this
- name: Start nginx
  command: systemctl start nginx

# Do this instead  
- name: Start nginx
  service:
    name: nginx
    state: started

Test your playbooks by running them multiple times. They should complete each time successfully without making unnecessary changes. If they report changes on every run, they’re not truly idempotent.

Use dynamic inventory when possible

Static inventory files become outdated as infrastructure changes. Instances get added and removed. IP addresses change. Tags get updated.

Dynamic inventory queries your infrastructure in real-time. It always reflects the current state. This prevents drift between your inventory and reality.

# Use AWS dynamic inventory
plugin: aws_ec2
regions:
  - us-east-1
keyed_groups:
  - key: tags.Environment
    prefix: env
  - key: tags.Role
    prefix: role

Dynamic inventory works with all major cloud providers and virtualization platforms. It eliminates a common source of drift.

Implement comprehensive testing

Test playbooks before running them in production by setting up test environments that mirror production and running playbooks there first to catch issues.

Use Molecule for role testing. Molecule is a testing framework built for Ansible roles. It spins up test instances, applies your roles, and verifies the results. This catches problems before they affect real infrastructure.

# molecule.yml
scenario:
  name: default
  test_sequence:
    - destroy
    - create
    - converge
    - idempotence
    - verify
    - destroy

Testing idempotence helps catch drift. Run a playbook, then run it again. If something changes the second time, you’ve got a problem.

Document emergency procedures

Emergencies happen. Sometimes you need to skip Ansible and fix things fast. Write down how to handle this properly.

Create a process for emergency changes that requires documentation in incident reports, sets up automated reminders to update playbooks after incidents, and makes it someone’s job to reconcile emergency changes back into Ansible.

# Emergency change template
Incident: [INC-12345]
System: [webserver-prod-01]
Change made: [Updated nginx worker_processes from 4 to 8]
Reason: [High CPU during traffic spike]
Ansible update required: [Yes]
Playbook updated: [Pending]

This ensures emergency fixes don’t become permanent drift.

Separate concerns clearly

Split ownership cleanly across teams:

Platform team: owns the operating system baseline and base packages.
Security: owns hardening standards and compliance controls.
Application teams: own service deployment and day-to-day changes.

Mirror that split in automation. Keep separate playbooks (or roles) for each domain, and enforce role-based access control in Ansible Automation Platform or AWX and in Spacelift so teams can run what they own, and nothing else.

# Clear ownership structure
playbooks/
  platform/
    base-os.yml          # Platform team
  security/  
    hardening.yml        # Security team
  applications/
    webapp.yml           # Application team

Run these playbooks through a governed workflow: version-controlled changes, approvals where needed, and a clear audit trail of who ran what, when, and against which environments. That gives teams faster self-service without creating configuration conflicts or bypassing controls.

Monitor and alert on drift

Don’t wait to discover drift accidentally. Instead, set up monitoring that actively looks for it by running regular drift detection checks and alerting when drift is found.

Integrate drift detection with your monitoring stack so you can send alerts to appropriate teams and track drift metrics over time.

- name: Check for drift
  hosts: all
  tasks:
    - name: Run in check mode
      include_tasks: verify-state.yml
      check_mode: yes
      register: drift_check
      
    - name: Send alert if drift detected
      uri:
        url: "https://monitoring.example.com/alert"
        method: POST
        body: "{{ drift_check }}"
      when: drift_check.changed

This shows you drift as it happens, rather than finding it during an incident.

Understanding and reconciling drift with Spacelift

Spacelift adds drift detection and remediation to your Ansible workflows. Manage drift across your infrastructure from one place.

Scheduled drift detection

Spacelift runs scheduled drift detection checks on your Ansible stacks, letting you configure how often checks run, set policies for which stacks get monitored, and view drift detection results in a unified dashboard.

The platform executes your playbooks in check mode on your defined schedule. It captures the output and presents it in an easy-to-understand format. You’ll see what drifted, the differences, and when it happened.

You don’t need to build drift detection yourself. Spacelift handles the scheduling, runs your playbooks, keeps logs, and alerts you.

Fixing drift automatically

Spacelift can fix drift on its own when it spots it. Set it to run your playbooks automatically and put things back how they should be, or make it wait for you to approve the fix first.

Set up policies that control remediation behavior. Some drifts might require immediate correction, while others might need review before action. Regardless of which is the case, Spacelift lets you handle each case appropriately.

The platform respects your existing Ansible code. No special modifications needed for your playbooks. Works with whatever Ansible content you’ve got.

Using multiple tools together

Most environments run Ansible with other tools. Terraform provisions infrastructure, Ansible configures it, and Kubernetes manages containers.

Spacelift connects these by making one stack depend on another. For instance, it could pass Terraform data into Ansible, this way, allowing you to build workflows that use multiple tools.

This stops drift from tools stepping on each other. Spacelift keeps them coordinated with proper handoffs and shared information.

Visibility and compliance

Spacelift shows you everything happening in Ansible. Things like what playbooks ran, what changed, who did it, and so on. With this, you’ll spot drift patterns as they develop.

This increased visibility, in turn, makes compliance easier. Need to show auditors something? Pull up how your infrastructure matches policies. Show them your change tracking. Prove you catch drift and fix it.

Hook it up to your version control, CI/CD, and monitoring. Drift detection just becomes part of managing infrastructure.

Learn more about using Spacelift with Ansible. If you want to take your infrastructure to the next level, create a Spacelift account today or book a demo with one of our engineers.

Key points

Configuration drift occurs when your servers don’t match the configurations defined in your Ansible playbooks. This breaks automation reliability and makes infrastructure harder to manage.

Drift happens for several reasons, including manual changes during incidents, multiple automation tools with overlapping responsibilities, automatic package updates, cloud provider changes, and poor coordination between teams.

To catch drift, run Ansible in check mode, compare facts against your baseline, schedule verification playbooks, monitor file integrity, or use a centralized platform like Ansible Tower or Spacelift.

When you find drift, determine whether the changes are worth keeping. If they are, update your playbooks. If not, run your playbooks to undo them. You can also automate the fixes in some cases.

Stop drift before it happens: make Ansible your only source of truth, keep all changes in version control, write idempotent playbooks, use dynamic inventory, test your changes, document what to do in emergencies, keep responsibilities separate, and watch for drift actively.

Spacelift automates drift detection and remediation for Ansible while providing visibility, compliance tracking, and orchestration across multiple infrastructure tools.

Manage Ansible better with Spacelift

Managing large-scale playbook execution is hard. Spacelift enables you to automate Ansible playbook execution with visibility and control over resources, and seamlessly link provisioning and configuration workflows.

Learn more

Frequently asked questions

How to set up a source of truth for Ansible managed infra?
Use a Git repo as the source of truth, store Ansible playbooks, roles, inventories, and vars there, then enforce changes via pull requests, code review, and CI runs that execute linting and check mode. For dynamic environments, generate inventory from a CMDB or cloud API and treat it as code too, then have your pipeline render it deterministically and apply changes only from the main branch.
How often should Ansible drift detection run?
Drift detection with Ansible should typically run at regular intervals aligned with your infrastructure’s change frequency, often every 15–60 minutes for dynamic environments. For more static systems, daily or even weekly checks may be sufficient to balance accuracy with resource usage.
Can Ansible prevent drift or only detect and fix it?
Ansible doesn’t prevent drift in real time, but can detect and correct it during playbook runs. Since it uses a declarative approach with idempotent tasks, reapplying the playbook enforces the desired state.