Cloud-based infrastructure is exceptionally good at scaling to support millions of users. It seems effortless, but things start getting complicated when you move beyond a simple autoscaling group and a load balancer. Once you start dealing with Terraform, Kubernetes, Ansible, and other Infrastructure as Code tools, your codebase begins to swell.
A codebase usually swells because numerous engineers are working on different parts of it. As the number of people involved grows, so does the potential for mistakes. Things like syntax errors or the occasional forgotten comment can be mitigated quickly and harmlessly, but mistakes such as a leaked security key, improper storage security setting, or an open security group could prove disastrous.
That’s why it’s important to investigate ways to manage your infrastructure as it really starts to scale.
What we will cover:
Scalable infrastructure refers to the infrastructure’s capability to increase or decrease in size based on demand in a flexible and efficient manner. This concept is essential for modern businesses that must quickly adapt to fluctuating workloads and user demands without sacrificing performance or incurring excessive costs.
What are the benefits of scalable infrastructure?
Scalable infrastructure accommodates your needs from four perspectives:
- Availability — Scalable infrastructure ensures the application is up and running at all times.
- Application performance — Scalable infrastructure ensures that your performance is not affected.
- Better quality of service — By improving availability and application performance, you can easily offer better quality to your customers.
- Cost — By scaling down during idle times, you reduce costs associated with your infrastructure.
The non-scalable way of managing IaC remains very popular and continues to be useful for single-developer shops and small stacks that just don’t want to deal with the added complexity and abstraction of the methods we will discuss in this article.
Some non-scalable strategies, such as using git repositories and basic security controls, are absolutely crucial at any scale if you don’t want your code to disappear after one fateful lightning strike, but many others are not entirely necessary when starting out.
Being pragmatic and writing readable and maintainable code is incredibly important when building, but eventually, you’ll need to start preparing for scale.
When infrastructure developers run infrastructure deployments from their terminals, consistency almost invariably takes a major hit.
Developers coding from their terminals is perfectly acceptable. In fact, the idea that developers should be tied to a remote terminal managed by the company is a vision that only the most diehard corporate security evangelists cherish. Developers need the freedom to manage their environment completely, but that freedom should be demarcated at deployment.
These terminals should have access to the repository to which the code is being committed and nowhere further. Handing out access keys to cloud environments is just asking for trouble. It seems easy when you’re deploying a few environments, but when you’re deploying to hundreds of environments or more, things begin to get pretty unwieldy. It’s not just a conversation about key rotation; you also have to consider traversal attacks and privilege escalation.
So, keeping everything managed in a more controlled environment greatly reduces your attack surface.
Here are some ways to manage your scaling infrastructure.
deprive developers of the ability to control their environment to maximize their personal productivity. So how do we break down the silos and ensure collaboration is seamless? GitOps is the answer.
The idea of a Git repository is a single source of truth for all code and the beginning of your deployments. The core principle on which tools like Spacelift operate. GitOps is going to be a central theme for this article.
GitOps leverages a Git workflow with continuous integration and continuous delivery (CI/CD) to automate infrastructure updates. Any time new code is merged, the CI/CD pipeline implements that change in the environment.
By centralizing the deployment mechanisms, it’s much easier to ensure everyone is able to deploy when they need to, with the right controls and policies in place. Also, by requiring all deployed code to be checked into a repository, this removes the silos and allows visibility into what your developers are working on.
It can be difficult to choose between a typical polyrepo (multiple repositories) strategy or a monorepo (a single repository with all code).
Facebook uses the monorepo strategy, trusting all their engineers to see all of the code. This makes code reuse much easier and enables a more simplistic set of security policies. If you’re an employee of Facebook, you get access. It’s that simple. However, given the level of trust required, this strategy can prompt issues when there are many divisions.
For companies such as financial institutions, which operate in strict regulatory environments, it is important to be able to restrict the visibility of each division’s code to the relevant engineers. The same goes for companies with multiple skunkworks-style divisions, where leaks can create disastrous consequences for marketing and legal teams.
For these companies, a polyrepo strategy is best. Managing the security of these to prevent traversal attacks and the like is the top priority for security teams, and permissions should be audited frequently. Once your repository structure is set up, your GitOps strategy can commence. Unfortunately, not everything can be managed within the repository.
State management is a huge issue (when the state is concerned). Terraform is a clear candidate for state-management issues. Other IaC tools, such as Ansible and Kubernetes, don’t encounter state-related issues, but there are still sensitive artifacts that must be managed.
Unfortunately, state can be problematic for the GitOps paradigm as you definitely don’t want to store state in your repository. Because of this, many companies use a combination of S3 and DynamoDB for their state management. Similar offerings from other cloud providers will suffice as well, but I’ll stick to the most common to keep things simple.
For those of you using Cloudformation or Arm/Bicep, your patience navigating cluttered syntax to avoid managing state is commendable.
Those of you using tools such as Terraform Cloud and Spacelift or open-source alternatives like Atlantis won’t find much to use in this section because these tools manage Terraform state for you.
Once you start managing workloads that could be considered scale, managing state is not the first thing you want to have to think about. Securing your state buckets, providing the right amount of access to those that need it, and managing the DynamoDB tables to maintain integrity through locks are all crucial elements to managing the state yourself. It’s not incredibly difficult, but the repercussions of a mistake are dire.
Always ensure you encrypt your state and use any features available within your IaC tool to help keep tabs on sensitive values. In Terraform, you can mark values as sensitive to ensure they don’t leak into the terminal, but these values are still in the state to be consumed by anyone with access. Everything you need is in the documentation, so just make sure you keep things secure and manage locks properly to avoid integrity issues.
The concept of “shift left” security puts much of the onus of security on developers, but the security team must also impose a set of controls on the other end of the spectrum in the deployed environment. This leaves the two human components with the bulk of the weight at either end of the security barbell to constantly have to work around each other.
Using GitOps to implement security testing through the entire pipeline is critical to a scalable deployment. This seems like a no-brainer, but many organizations place this on the back burner or opt to perform basic code-linting without any actual policies.
Using a policy management platform, such as Open Policy Agent (OPA), is critical to keeping your security footprint in check at every step of the deployment. Leveraging OPA, you can automate compliance checks within your CI/CD pipelines, providing a powerful safeguard against potentially malicious jobs and services running on your systems and increasing the efficiency of the development process.
By doing this, you’ll ensure you don’t have the weight on both sides of the deployment, and you distribute it more evenly throughout the process.
We’ll dive a little deeper into these processes throughout the article.
Ensuring the minimum number of resources are affected if something goes wrong is an intrinsic element of infrastructure scaling. When dealing with thousands of resources, one fateful configuration disaster can take hours, or even days, to recover from. Some of you may remember this fateful S3 outage caused by a misconfigured “Playbook”: https://aws.amazon.com/message/41926/
If AWS had implemented policies that required confirmation when a certain number of critical resources were modified, this whole incident could have been avoided.
One of the most common ways to configure these policies is to use a policy-as-code tool such as OPA and a clever “scoring” algorithm to score your resources based on their importance. If the resources set to be modified have a score above a threshold, it can require manual intervention from management, SREs, or whoever is on the list.
For example,
EC2 instance in a 100 EC2 instance autoscaling pool: 1 point
Redundant Load Balancer in a dual LB setup: 25 points
Production Database: 100 points
You could easily set a threshold using OPA’s scripting language, Rego, to require authorization if the total points > 49. This would ensure you don’t lose two load balancers — more than half your EC2 instances — and definitely not your production database.
Here is another example written in Rego that illustrates the concept:
package spacelift
# This policy attempts to create a metric called a "blast radius" - that is how much the change will affect the whole stack.
# It assigns special multipliers to some types of resources changed and treats different types of changes differently.
# deletes and updates are more "expensive" because they affect live resources, while new resources are generally safer
# and thus "cheaper". We will fail Pull Requests with changes violating this policy, but require human action
# through **warnings** when these changes hit the tracked branch.
proposed := input.spacelift.run.type == "PROPOSED"
deny[msg] {
proposed
msg := blast_radius_too_high[_]
}
warn[msg] {
not proposed
msg := blast_radius_too_high[_]
}
blast_radius_too_high[sprintf("change blast radius too high (%d/100)", [blast_radius])] {
blast_radius := sum([blast |
resource := input.terraform.resource_changes[_]
blast := blast_radius_for_resource(resource)
])
blast_radius > 100
}
blast_radius_for_resource(resource) = ret {
blasts_radii_by_action := {"delete": 10, "update": 5, "create": 1, "no-op": 0}
ret := sum([value |
action := resource.change.actions[_]
action_impact := blasts_radii_by_action[action]
type_impact := blast_radius_for_type(resource.type)
value := action_impact * type_impact
])
}
# Let's give some types of resources special blast multipliers.
blasts_radii_by_type := {"aws_ecs_cluster": 20, "aws_ecs_user": 10, "aws_ecs_role": 5}
# By default, blast radius has a value of 1.
blast_radius_for_type(type) = 1 {
not blasts_radii_by_type[type]
}
blast_radius_for_type(type) = ret {
blasts_radii_by_type[type] = ret
}
In the above example, you can see the different “blast_radii_by_type” and their definitions:
“aws_ecs_cluster”: 20
“aws_ecs_user”: 10
“aws_ecs_role”: 5
And the deny rule states that if “blast_radius_too_high” is true, then to deny the run. The current threshold is set at > 100. This obviously gets much more complicated as you start working with a significant amount of infrastructure, but it’s a great starting point.
By limiting infrastructure developers’ ability to deploy arbitrary resources and attributes, you help constrain the blast radius and solidify the security posture you require. By coupling strict linting and code-scanning rules with a module-based policy, you can ensure only the right settings get deployed. This also helps with costing as you can ensure only certain resources within your budgetary constraints are available within the modules.
Here is an example of a module-oriented structure:
- Modules are created by the infrastructure team responsible for that module — networking, security, compute, etc.
- Condition Checks are added to ensure proper constraints.
- Pre-commit hooks lint and scan for any vulnerabilities using tools such as tfsec or Checkov.
- Modules are committed to VCS, where they are tested or consumed by a module registry.
- If you are using a platform such as Spacelift that has a private module registry with built-in testing, those tests can be run on the modules to ensure they operate correctly.
- Policies are written to ensure deployments to break any constraints. Open Policy Agent is great for this. Terraform Cloud also offers Sentinel.
- Engineers are given access to these modules as their roles dictate. All of the guardrails are in place, and they are unable to configure parameters outside of those boundaries without triggering policies in place.
- Code is deployed, and final compliance tests are performed within the environment to ensure everything is still up to standards. This could be using services such as AWS Cloudtrail or GCP Cloud Audit Logs, for example.
Role-Based Access Control is a very important aspect of your security posture. Kubernetes has native RBAC that is extremely useful when coupled with Namespaces. Terraform doesn’t exactly have an RBAC system, but you can use a module-based structure as we discussed previously to minimize the risk and help ensure standards are maintained.
Secret management is important whether you’re a two-person startup or a 200,000-person corporation. Protecting your secrets is a foundational concept that doesn’t require much explanation.
Verifying that secrets never enter your repository to begin with is a more crucial element as your codebase scales. With a few thousand lines of code in a monorepo, it can be fairly straightforward to manage your secrets, but once you get into the millions of lines of code, things get rather complicated.
Ensure you’re using code scanning tools, encryption utilities, and anything else at your disposal that can help ensure this doesn’t happen. AWS Secrets Manager, Azure Key Vault, and Hashicorp Vault are all examples of applications that can assist with this. And, of course, if you do have to have some sort of static credentials instead of recommended short-lived ones, ensure they are rotated frequently.
In Spacelift, secret definitions and files can be managed within a Context, which allows you to define any secrets you have and have them automatically encrypted.
Secrets can also be added directly to the stack’s environment, but the contexts allow these secrets to be shared between stacks. This greatly simplifies the secret management process. Authentication to the cloud providers and VCS providers is also incredibly important to manage properly.
If your cloud is AWS, using AWS Roles, AWS STS, and other temporary credentials services is crucial. Other clouds have similar services that must also be used at all costs.
Spacelift simplifies this process with its Cloud Integrations service that automatically assumes temporary credentials for each run.
As far as VCS authentication goes, this should be done with an SSO service or similar whenever possible. Spacelift also automates this process and allows full access to the repositories necessary for deployment without any passing around of credentials.
As you build scalable infrastructure, you are likely to have many CI/CD pipelines running in the background. These pipelines need to get inputs from somewhere, and if you don’t leverage an approach that takes advantage of shareable variables or files, you will probably end up with repetitive code, or at least with duplication of these variables or files.
Shared variables are significant from both an infrastructure and an application standpoint. If you think about infrastructure, the first thing that comes to mind is the authentication to your cloud provider. Given that you are likely to have multiple automations that need to interact in some way with your cloud infrastructure, you will need to set up multiple credentials. Shared variables make this easier because you set up the variables just once, and you can then consume them in all the things you are building. These variables should be secured and factor in everything mentioned under secret management and RBAC before they are implemented.
In other cases, you can use shared variables to simply speed up the process of sharing data among multiple points inside your workflow. Many available tools can help with this type of action, and most of them have some sort of secret-management mechanism embedded in them.
Sharing files is crucial for various big data applications. Bear in mind you have many IoT devices that write data to json files, hundreds of GBs daily. The data from these files should be easily accessible by some of your automations and it should be manipulated to render whatever you need inside your application. This would be almost impossible without the capability to share these files inside a distributed system because the time required to extract the data in a meaningful way would be much longer than the time required to collect it. Most cloud providers offer solutions for this kind of exercise, and there are also open source alternatives you can install and manage yourself if you want to build something in-house.
Sharing variables and files between nodes, processes, and tools, optimizes your ability to make faster deployments, reducing the potential for human error and also making systems more efficient and robust.
The actual deployment of most resources should be an automated process. With continuous delivery, deployment is automated, eliminating virtually all complexity. Manual approvals require considerable human intervention, making the process more complicated and slowing it to a crawl. Requiring them should be the exception, not the rule. You can use the strategies we’ve outlined — blast radius, OPA usage, etc.— to make intelligent decisions on when to involve manual processes.
Another step you may want to take during deployments is to test the deployment in another environment. Deploying to a test or staging environment before deploying to production is an integral part of the deployment process. Managing this pipeline, catching edge cases, and keeping it as hands-off as possible is crucial when dealing with a large number of deployments daily.
Some organizations may deploy to a test environment first by pushing changes to a dev branch, testing fully, then run a merge with prod, which deploys again using a different context.
Others may want to push straight to a prod branch, have the pipeline run the plan, then deploy. The difference here is that the code merge actually happens after the deployment has been made. This is sometimes referred to as a “pre-merge apply” and is commonly used within the Atlantis open-source IaC deployment tool. This strategy allows the main branch to remain clean if there are issues encountered. You definitely want to be cautious about your blast radius here. Even if the plans are usually pretty reliable, things do happen.
Check out 7 steps to optimize your CI/CD pipelines for scaling.
Building scalable infrastructure usually leverages infrastructure as code. Building modular and reusable infrastructure that leverages scaling usually involves using loops and services that can autoscale.
For example, you could build many EC2 instances using this OpenTofu configuration:
data "aws_ami" "ubuntu" {
most_recent = true
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
filter {
name = "architecture"
values = ["x86_64"]
}
owners = ["099720109477"] #canonical
}
locals {
instances = {
instance1 = {
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
}
instance2 = {
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
}
instance3 = {
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
}
}
}
resource "aws_key_pair" "ssh_key" {
key_name = "ec2"
public_key = file(var.public_key)
}
resource "aws_instance" "this" {
for_each = local.instances
ami = each.value.ami
instance_type = each.value.instance_type
key_name = aws_key_pair.ssh_key.key_name
associate_public_ip_address = true
tags = {
Name = each.key
}
}
By changing the local variable instances, you can easily scale up or down the number of instances. This is a scalable infrastructure configuration that can scale manually based on the input.
At the same, if you use an autoscaling group, you could easily scale up or down based on the load:
resource "aws_launch_template" "example" {
name_prefix = "example-"
image_id = var.ami
instance_type = var.instance_type
}
resource "aws_autoscaling_group" "example" {
desired_capacity = 5
max_size = 10
min_size = 3
vpc_zone_identifier = var.subnets
launch_template {
id = aws_launch_template.example.id
version = "$Latest"
}
}
This will have a minimum of three instances, a maximum of ten instances, and a desired capacity of five.
There are many things that you should consider when building a scalable infrastructure:
- Use a modular and reusable architecture – Build modules if you are using OpenTofu or Terraform, take advantage of microservices, and split your workflows into smaller pieces.
- Automation – Leverage IaC, CI/CD, configuration management, and container orchestrations.
- Take advantage of scaling mechanisms – Take advantage of autoscaling (horizontal and vertical scalers) and use services that are built for scaling, such as AWS ASG or AWS EKS.
- Use load balancers – Distribute traffic to services using load balances, to ensure the load is distributed based on your defined criteria.
- Implement monitoring and logging – Track the performance and health of your infrastructure.
- Implement security and governance – Use policies to restrict resources and resource parameters, and take advantage of network security groups, firewalls, network ACLs, encryption, and RBAC.
- Cost management – Take the necessary measures to avoid cost-related issues: right-sizing of computing and taking advantage of discounts (reserved instances).
Spacelift offers all the necessary mechanisms to manage infrastructure at scale while implementing all the best practices.
With Spacelift, you can easily ramp up your workflows with:
- Dependencies – Build a simple workflow that can combine into a complex one by creating dependencies between them and sharing outputs. This ensures that issues are easy to find and the risk associated with changes is minimized.
- Policies – Take advantage of safer deployments by restricting the resources that engineers can create and the parameters they can have, adding a minimum number of approvals for a run, building custom policies for your custom integrations, controlling where to send notifications and metrics, and controlling what happens when a pull request is open or merged.
- Contexts – Build reusable containers for your environment variables, mounted files, and lifecycle hooks and easily attach them to how many stacks you want, promoting reusability and ensuring that everything will happen in the same way you would expect.
- Cloud integrations – Easily leverage dynamic credentials for major cloud providers (AWS, Azure, GCP).
- Blueprints – Take advantage of self-service infrastructure to enhance developer velocity.
- Drift detection and optional remediation – Eeasily detect drift and optionally solve it automatically.
- Resource view – View all the resources that have been deployed with your Spacelift account and get information about their health.
- Private workers – Take advantage of your own workers to implement the regulations associated with your organization.
To learn more about Spacelift, create a free account today or book a demo with one of our engineers.
Looking at the issues we’ve discussed here, it’s easy to see just how unwieldy and error-prone infrastructure deployments can become. Entire organizations suffer as technical debt mounts, and developers use workarounds and emergency patches to get their jobs done and avoid opening a ticket for every little thing they need to deploy. Things can quickly start to collapse.
By following the guidance in this article and taking the time to really map out your processes while engaging all teams and stakeholders involved, you will be able to scale your infrastructure deployments as far as your ambition requires.
Automation and Collaboration Layer for Infrastructure as Code
Spacelift is a flexible orchestration solution for IaC development. It delivers enhanced collaboration, automation and controls to simplify and accelerate the provisioning of cloud based infrastructures.