Going to AWS re:Invent 2024?

➡️ Book a meeting with Spacelift

General

How to Manage Infrastructure as Code at Scale (Examples)

how to manage infrastructure as code at scale

Scalability and IaC are not always a match made in heaven. There are several touchy areas within the development environment, and associated tooling must be seriously considered to ensure effective “management” of IaC to assure scalability. 

This blog post will touch on a few critical areas that can negatively impact IaC when provisioning and deploying complex, large-scale infrastructures. It will also show how to manage IaC at scale without workarounds, a manual, or a partially automated development process.

Limitations of Infrastructure as Code Platforms

The first area to understand is IaC code platforms. Terraform is the de facto standard and most widely used IaC coding platform by far; several alternatives, such as Pulumi, Cloudformation, and Ansible, are available. Each of them has its relative strengths and weaknesses.

While we are huge fans of Terraform, we are also pragmatic about its shortcomings, limiting the ability of a developer to manage IaC at scale cleanly. Below, we highlight some of these weaknesses that should be considered:

  • Resource management: Running the terraform state mv command every time there is a changed resource identifier is painful and time-consuming. It is further complicated by a large codebase, as it is challenging to maintain clean code and evolve over time
  • TF State: Terraform tracks the current state of managed resources using plan and apply. State is very fragile and can be impacted, for example, by a stale state file or by forgetting to run “terraform remote config”
  • Local machines: Each action is manual and must be performed directly by the developer, leading to multiple people working on the same codebase. Developer A applies their changes to the environment and initiates a pull request. At the same time, developer B would like to deploy their changes but is blocked as the codebase is not up-to-date with developer A’s changes. Applying the changes in this state would cause damage to the resources created in the previous deployment. Working in larger teams will often result in wasted time just waiting and rebasing the codebase to apply changes.
  • Security: Devs working with the codebase need access to the Terraform state, which creates a security concern due to how Terraform state works. Pulling sensitive data from an external system such as Vault will be stored within the state file in plaintext once used in a resource, so a threat actor could run `terraform state pull` to access all the secrets
  • CLI Interface: Typos and other types of errors are common, resulting in misconfigurations or failed plans and applies
  • Infrastructure Fine-Tuning: It is difficult to interface with subsets of the infrastructure to make small changes dynamically. Terraform works well for bulk deployments but lacks the delicate touch necessary for fine-tuning infrastructures
  • Rollback: Terraform does not support rollback to a previous state due to limitations of Terraform plan, which means that the only option is to roll forward

New to IaC? See: What Is Infrastructure as Code? Examples, Best Practices & Tools

Benefits of Using CI/CD Processes and Workflows to Manage IaC

The short answer is yes! IaC development can be significantly enhanced using CI/CD processes and workflows like applications or tool development. Using CI/CD to provide workflows for infrastructure provisioning and deployment offers the following benefits: 

Consistency:

  • Code is controlled – well documented and optimized.
  • Code is modular – reused across multiple projects.

Testing:

  • Testing is integrated – cannot be skipped.
  • It is far simpler to test code in small bites – find issues faster not hunting for a needle in a haystack – 100 loc vs. 10K loc.

Automation:

  • Routing of change approvals – single or multiple sign-offs.
  • Establishes a set of controllable mechanisms – gates for events, etc.

See how to deploy your infrastructure in CI/CD here: How to Deploy Your Infrastructure in CI/CD Using Terraform

A Complex Infrastructure Example

The below diagram provides an example of a Fintech application and support for infra implemented in AWS. This solution is used by financial institutions, banks, or specialized lenders to manage their Asset Based Lending portfolio and deliver their services to their clients.

Fintech ABL Application and Infrastructure

In this example infrastructure, each of the infra resources designated with the TF logo has Terraform providers, or HCL code is available to configure the infrastructure.

Provisioning and deploying for new clients is complex as it has to meet their individual compliance requirements and banking and regulatory rules, regulations, and standards. And what about a foreign client bank? For this, banks must meet completely different banking and regulatory rules and compliance standards. 

And of course, overarching everything are stringent cybersecurity requirements, which are necessary in many cases as the application has to be able to securely interface with external core banking systems for daily updates on loan status with each customer.

Imagine duplicating this infrastructure each time a new client is onboarded using a non-IaC optimized CI/CD solution.

IaC and CI/CD

As mentioned, a CI/CD solution is strongly recommended for provisioning and deploying IaC based infrastructures. One of the most widely used CI/CD solutions is Jenkins. 

However, Jenkins and other generic CI/CD solutions were designed and optimized for application-centric workflows, and have limitations that will prevent scalability when implemented in support of IaC-specific development. Some of these issues include:

  • IaC is a stack-based technology, and generic CI/CD tools do not understand stacks.
  • No unified toolset or framework for policy management requiring manual checks.
  • Lack of built-in synchronization, resulting in manual updates and synchronization.
  • Inability to provide an automated workflow across multiple repositories and stacks.
  • Cannot visualize or provide cost tracking of resources, drift detection, and remediation.
  • In most cases, SSO and SAML are plugins—just another component to configure, manage, and update
  • Conflicting pull requests where one developer is already working on a piece of code and another developer is also making changes, possibly resulting in code discrepancies.

Generic CI/CD solutions check all the boxes to support implementing a well-defined development process, adding end-to-end workflows allowing DevOps teams to deliver better applications faster, but not for IaC. 

Accountability and IaC

What do we mean by accountability? It is analogous to a business having a set of accounting books that represent the financial condition of the company as it stands at the present moment, but also its financial history. 

Without this history, it is impossible to trace back and establish accountability, to understand when financial decisions and changes were made, actions taken, and by whom and for what purpose.

Accountability is also applicable to your infrastructure as it enables you to understand the present state and how you arrived at this point. 

Infrastructure accountability provides checks and balances to ensure that you can account for all infrastructure-related events. From the first line of code to production, you need to know and understand every event that has or will affect the infrastructure to ensure that process integrity exists and, by extension, so does proof of compliance, both internally and externally. Some of the accountability questions that you have to be able to answer are:

  • Code History:
    • What is the history of the code?
    • Who made changes to the code and infrastructure.
    • Do you know if access to your repo is secure? 
  • Change Implementation:
    • Has a change been fully implemented?
    • Can you validate and prove that it was deployed?
    • Are paper-based policies and regulatory compliance requirements being met? Can you prove it?
  • Change Approval Process:
    • What is the change approval process?
    • Are all changes approved, no matter how small or innocuous they seem?
  • Auditability:
    • Are you prepared for an audit?
    • Do you have the supporting materials necessary to satisfy security and compliance audits?

What types of information do you need to answer the above questions?

  • Provenance: The ability to trace every piece of code back to inception, along with any changes made.
  • Compliance: Provide detailed information needed to ensure compliance with policies and procedures, and the ability to provide complete documentation for an audit.
  • Access: Who has access, what do they have access to, what are they allowed to do, when are they allowed to do it, why are they doing it.
  • Visibility: Understand the state of the infrastructure and resources at any point in time.

Fintech Example Security Overlay

The complex infrastructure example above gives an overview of a current ABL Fintech solution. This diagram represents the security that is provisioned and deployed on top of the infrastructure for the application.

Fintech ABL Security Overlay

In many cases, individual clients require VPCs for their implementation, with multiple layers of protection, as they may have different security requirements or their specific vendor solutions in place. It is especially relevant if financial institutions have product offerings beyond ABL that have strict regulatory or industry compliance requirements based on their geographical location.

The Terraform logo highlights these security-specific resources supported in Terraform, and can be defined and provisioned.

The ability to control the following security aspects for access, source code, changes, and deployments, as well as validation thereof, is not achievable without embedded policy guardrails in place. They are essential to achieving secure infrastructure provisioning and deployments. Without policies in place, the probability of a bank passing an internal or external audit is very low.

How to Manage Infrastructure as Code at Scale

Now that we’ve covered code platforms, CI/CD solutions, and accountability to help you understand the significant issues that stand in your way of IaC scalability, let’s take a look at what can help you to manage IaC at scale without workarounds, a manual, or a partially automated development process: 

Automate

Automation covers the end-to-end IaC workflow from repo to production:

  • Integrated with source code repos such as Github to trigger IaC workflows when committing new or revised code.
  • Creates a library of pre-configured infrastructure environments, enabling sharing across teams and projects.

Use Policy as Code

Policy as Code provides immediately enforceable guardrails:

  • Declare rules around infrastructure, access to code and stacks, workflow, state changes, set triggers. 
  • Access controls for the 4Ws.
  • Multi-layer change approvals and applies after approval.
  • Audit support, SSO (SAML2.0) fully integrated. 

Collaborate

Adaptable to existing workflows with no force-fitting or changes required, or fully customized to meet goals:

  • Coordination of all runs with the ability to review, comment, and iterate across teams.
  • Prevent conflicting changes and reconcile the differences automatically.

Introduce Resource Management

Understand resources and their state:

  • Plan out changes.
  • Drift detection and remediation.
  • Cost of infrastructure.

In addition to all of the above, here are some additional vital capabilities that an ideal, purpose-built orchestration and management solution for IaC should also provide, at a minimum! Starting from the left of the diagram:

  • Creation of pre-built library of infra components.
  • Support of mono or multiple repos with version control.
  • Ability to support multiple projects simultaneously.
  • Integration and application of policy as code in concert with IaC configurations, ensuring that compliance and security initiatives are met.

Learn how to manage Terraform at scale here: 5 Ways to Manage Terraform at Scale (Examples)

Key Points

You can take positive actions to overcome the scalability issues of IaC code platforms, generic CI/CD solutions, and accountability to prevent your infrastructure from inhibiting your future business initiatives and growth.

You can be realistic and find an IaC orchestration and management solution that enables your development process to scale. 

And finally, you ought to be aware that proactive management and control over compliance and audit objectives are necessary and can be covered through Policy as Code, integrated into your IaC development, security, provisioning, and deployment workflows.

If you are concerned about achieving IaC scalability, try a free trial of Spacelift! You can overcome code platform issues and enable accountability with a CI/CD solution built from the ground up for IaC. Control the process, visualize resources and create workflows that work, using Spacelift as your IaC orchestration and management solution.

Continuous Integration and Deployment for your IaC

Spacelift allows you to automate, audit, secure, and continuously deliver your infrastructure.  It helps overcome common state management issues and adds several must-have features for infrastructure management.

Get started

The Practitioner’s Guide to Scaling Infrastructure as Code

Transform your IaC management to scale

securely, efficiently, and productively

into the future.

ebook global banner
Share your data and download the guide