OpenTofu is now part of the Linux Foundation 🎉

Read more here →

General

10 Strategies to Build and Manage Scalable Infrastructure

Strategies to Manage Infrastructure at Scale

Cloud-based infrastructure is amazing in how it is capable of scaling to uncanny heights in order to support millions of users. It seems effortless, in a way, but the reality is things start to get a little crazy when you’re doing anything outside of a simple autoscaling group and a load balancer. Once you start dealing with Terraform, Kubernetes, Ansible, and other Infrastructure as Code tools, your codebase begins to swell.

A swelling codebase is usually due to a large number of engineers working on several different pieces of the puzzle. Dealing with multiple humans means there will inevitably be mistakes. Some of these mistakes, such as syntax errors or the occasional forgotten comment, are quickly mitigated and cause no harm. Other mistakes, such as a leaked security key, improper storage security setting, or an open security group, can prove to be disastrous.

Let’s take a look at some ways to manage your infrastructure at scale when things really start to take off.

1. IaC Tools and the Non-scaleable Way

I wanted to call this the “old” way, but to be honest, the non-scaleable way that I’m going to discuss is still heavily in use. In fact, the non-scaleable way is quite useful for single-developer shops and small stacks that just don’t want to deal with the added complexity and abstraction of the methods mentioned in this article.

Some of the strategies, such as using git repositories and basic security controls, are absolutely crucial at any scale if you wish for your code not to disappear into the ether after one fateful lightning strike, but many of the others just aren’t entirely necessary when starting out.

Being pragmatic and writing readable and maintainable code is incredibly important when building, but eventually, you’ll need to start preparing for “scale”. Let’s take a look at some of the issues with the non-scaleable way of doing things.

2. Terminals and Silos

Obviously, we’re not talking about train stations and farming infrastructure here. When infrastructure developers are running infrastructure deployments from their terminals, consistency will most likely take a major hit.

Now, don’t get me wrong, developers coding from their terminals is perfectly fine. The idea that developers should be tied to a remote terminal managed by the company is a pipedream that only the most die-hard corporate security evangelists have. Developers need the freedom to manage their environment completely, but that freedom should be demarcated at deployment.

These terminals should have access to the repository to which the code is being committed and nowhere further. Handing out access keys to cloud environments is just asking for trouble. It seems easy when you’re deploying a few environments, but when you’re deploying to hundreds of environments or more, things begin to get pretty unwieldy. It’s not just a conversation about key rotation; you also have to consider traversal attacks and privilege escalation.

So, keeping everything managed in a more controlled environment greatly reduces your attack surface.

3. Enter GitOps!

We’ve established your developers are not going to leave THEIR terminals. It would be outright silly to expect developers not to be able to control their environment to maximize their own productivity. So how do we break down the silos and ensure collaboration is seamless? GitOps, of course!

The idea of a Git repository is a single source of truth for all code and the beginning of your deployments. The core principle on which tools like Spacelift operate. GitOps is going to be a central theme for this article.

By centralizing the deployment mechanisms, it’s much easier to ensure everyone is able to deploy when they need to, with the right controls and policies in place. Also, by requiring all deployed code to be checked into a repository, this removes the silos and allows visibility into what your developers are working on.

4. Monorepo vs. Polyrepo

Choosing between using a typical polyrepo (multiple repositories) strategy or a monorepo (a single repository with all code) can be a difficult task.

Facebook, for example, chose the monorepo strategy. They trust all of their engineers to see all of the code. This makes code reuse much easier and allows a more simplistic set of security policies. If you’re an employee of Facebook, you get access. It’s as simple as that. This, of course, requires a lot of trust and when you have a huge number of divisions, it definitely can lead to issues.

Since Facebook doesn’t necessarily face the same regulatory constraints as, say, a major financial institution, this is a more efficient route for them in which they thrive. For companies that do face such regulatory challenges, being able to ensure engineers in one division don’t see the code of another can be incredibly important. This also goes for companies with multiple “skunk works” style divisions where any leak can cause disastrous consequences for marketing…or the legal team.

For these companies, a polyrepo strategy is best. Managing the security of these to prevent traversal attacks and the like is the top priority for security teams, and permissions should be audited frequently. Once your repository structure is set up, your GitOps strategy can commence. Unfortunately, not everything can be managed within the repository.

5. State Management

State management is a huge issue (when the state is concerned). Terraform is a clear candidate for state-management issues. Other IaC tools, such as Ansible and Kubernetes, don’t encounter state-related issues, but there are still sensitive artifacts that must be managed.

Unfortunately, state can be a huge hassle for the GitOps paradigm as you definitely don’t want to store state in your repository. Because of this, many companies use a combination of S3 and DynamoDB for their state management. Similar offerings from other cloud providers will suffice as well, but I’ll stick to the most common to keep things simple.

For those of you using Cloudformation or Arm/Bicep, congratulations, your ability to battle through cluttered syntax to avoid managing state is commendable.

For those of you using tools such as Terraform Cloud and Spacelift, which manage Terraform state for you, you also may not find much use out of this section as your lives are much simpler.

Once you start managing workloads that could be considered scale, managing state is not the first thing you want to have to think about. Securing your state buckets, providing the right amount of access to those that need it, and managing the DynamoDB tables to maintain integrity through locks are all crucial elements to managing the state yourself. It’s not incredibly difficult, but the repercussions of a mistake are dire.

Always ensure you encrypt your state and use any features available within your IaC tool to help keep tabs on sensitive values. In Terraform, you can mark values as sensitive to ensure they don’t leak into the terminal, but these values are still in the state to be consumed by anyone with access. There’s not much else to say that the documentation doesn’t already, so just make sure you keep things secure and manage locks properly to avoid integrity issues.

6. “Barbell” Security

A “thought-leader” blog post wouldn’t be very “thought-leadery” without some quirky new term to explain a concept. This post is no different! In this case, I’m going to discuss what I call “barbell security”.

There has been a lot of conversation around “shift left” security, which puts more of the onus of security on developers, but there still needs to be a set of controls on the other end of the spectrum in the deployed environment. These controls are deployed by a security team.  Of course, this leaves the two human components with the bulk of the weight at either end of the security barbell to constantly have to work around each other.

Using GitOps to implement security testing through the entire pipeline is critical to a scalable deployment. For anyone dealing with a deployment that’s even close to being considered “scale”, this seems like a no-brainer, but a lot of organizations have placed this on the back burner or have opted to do basic code-linting without any actual policies.

Using a policy management platform, such as OPA, is critical to keeping your security footprint in check at every step of the deployment. By doing this, you’ll ensure you don’t have the weight on both sides of the deployment, and you distribute it more evenly throughout the process.

We’ll dive a little deeper into these processes throughout the article.

7. Policies Reduce Blast Radius

“Close” only matters in Horseshoes, right? Or so the old adage goes. If you’re not familiar with the game, Horseshoes allows you to acquire points just by touching the target versus actually “ringing” it. The game of Horseshoes allows you to score points based on a larger “blast radius”.

Well, in security, if you get your blast radius wrong, you won’t get any points. In fact, the only thing you may score is a termination. 

 

billy-madison-wrong

Ensuring that, if something does go wrong, a minimum number of resources is affected is an incredibly important task when scaling. When dealing with thousands of resources, one fateful configuration disaster can take hours, or even days, to recover from. Some of you may remember this fateful S3 outage caused by a misconfigured “Playbook”: https://aws.amazon.com/message/41926/

Had AWS had policies that required confirmation when a certain number of critical resources were modified, this whole incident could have been avoided. I’m certainly simplifying the incident somewhat, but you get the gist.

There are many ways to configure these policies, but one of the most common is to use a policy-as-code tool such as Open Policy Agent (OPA) and a clever “scoring” algorithm to score your resources based on their importance. If the resources set to be modified have a score above a threshold, it can require manual intervention from management, SREs, or whoever is on the list.

For example,

EC2 instance in a 100 EC2 instance autoscaling pool: 1 point

Redundant Load Balancer in a dual LB setup: 25 points

Production Database: 100 points

You could easily set a threshold using OPA’s scripting language, Rego, to require authorization if the total points > 49. This would ensure you don’t lose two load balancers, more than half of your EC2 instances, and definitely not your production database. This example is a little contrived, but I’m sure you get the point.

Here is another example written in Rego that illustrates the concept:

package spacelift

# This policy attempts to create a metric called a "blast radius" - that is how much the change will affect the whole stack.
# It assigns special multipliers to some types of resources changed and treats different types of changes differently.
# deletes and updates are more "expensive" because they affect live resources, while new resources are generally safer
# and thus "cheaper". We will fail Pull Requests with changes violating this policy, but require human action
# through **warnings** when these changes hit the tracked branch.

proposed := input.spacelift.run.type == "PROPOSED"

deny[msg] {
	proposed
	msg := blast_radius_too_high[_]
}

warn[msg] {
	not proposed
	msg := blast_radius_too_high[_]
}

blast_radius_too_high[sprintf("change blast radius too high (%d/100)", [blast_radius])] {
	blast_radius := sum([blast |
		resource := input.terraform.resource_changes[_]
		blast := blast_radius_for_resource(resource)
	])

	blast_radius > 100
}

blast_radius_for_resource(resource) = ret {
	blasts_radii_by_action := {"delete": 10, "update": 5, "create": 1, "no-op": 0}

	ret := sum([value |
		action := resource.change.actions[_]
		action_impact := blasts_radii_by_action[action]
		type_impact := blast_radius_for_type(resource.type)
		value := action_impact * type_impact
	])
}

# Let's give some types of resources special blast multipliers.
blasts_radii_by_type := {"aws_ecs_cluster": 20, "aws_ecs_user": 10, "aws_ecs_role": 5}

# By default, blast radius has a value of 1.
blast_radius_for_type(type) = 1 {
	not blasts_radii_by_type[type]
}

blast_radius_for_type(type) = ret {
	blasts_radii_by_type[type] = ret
}

In the above example, you can see the different “blast_radii_by_type” and their definitions:

aws_ecs_cluster”: 20

“aws_ecs_user”: 10

“aws_ecs_role”: 5

And the deny rule states that if “blast_radius_too_high” is true, then to deny the run. The current threshold is set at > 100. This obviously gets much more complicated as you start working with a significant amount of infrastructure, but it’s a great starting point.

8. Modules and RBAC

Role-Based Access Control is a very important aspect of your security posture. Kubernetes has native RBAC that is extremely useful when coupled with Namespaces. Terraform doesn’t exactly have an RBAC system, but modules are useful to help ensure standards are maintained.

By limiting infrastructure developers’ ability to deploy arbitrary resources and attributes, you help constrain the blast radius and solidify the security posture you require. By coupling strict linting and code-scanning rules with a module-based policy, you can ensure only the right settings get deployed. This also helps with costing as you can ensure only certain resources within your budgetary constraints are available within the modules.

An example of a module-oriented structure may be something like this:

  1. Modules are created by the infrastructure team responsible for that module. Networking, security, compute, etc. 
  2. Condition Checks are added to ensure proper constraints. 
  3. Pre-commit hooks lint and scan for any vulnerabilities using tools such as tfsec or checkov. 
  4. Modules are committed to VCS, where they are tested or consumed by a module registry.
  5. If you are using a platform such as Spacelift that has a private module registry with built-in testing, those tests can be run on the modules to ensure they operate correctly. 
  6. Policies are written to ensure deployments to break any constraints. Open Policy Agent is great for this. Terraform Cloud also offers Sentinel. 
  7. Engineers are given access to these modules as their roles dictate. All of the guardrails are in place, and they are unable to configure parameters outside of those boundaries without triggering policies in place. 
  8. Code is deployed, and final compliance tests are performed within the environment to ensure everything is still up to standards. This could be using services such as AWS Cloudtrail or GCP Cloud Audit Logs, for example.

9. Contexts and Secret Management

At any scale, secret management is important. This section really doesn’t change much whether you’re a 2-person startup or a 200,000-person corporation. Protecting your secrets is a foundational concept that doesn’t require a lot of explanation.

Verifying that secrets never enter your repository to begin with is a more crucial element as your codebase scales. With a few thousand lines of code in a monorepo, it can be fairly straightforward to manage your secrets, but once you get into the millions of lines of code, things get rather complicated.

Ensure you’re using code scanning tools, encryption utilities, and anything else at your disposal that can help ensure this doesn’t happen. AWS Secrets Manager, Azure Key Vault, and Hashicorp Vault are all examples of applications that can assist with this. And, of course, if you do have to have some sort of static credentials instead of recommended short-lived ones, ensure they are rotated frequently. 

In Spacelift, secret definitions and files can be managed within a Context, which allows you to define any secrets you have and have them automatically encrypted.

Secrets can also be added directly to the stack’s environment, but the contexts allow these secrets to be shared between stacks. This greatly simplifies the secret management process. Authentication to the cloud providers and VCS providers is also incredibly important to manage properly.

If your cloud is AWS, using AWS Roles, AWS STS, and other temporary credentials services is crucial. Other clouds have similar services that must also be used at all costs.

Spacelift simplifies this process with its Cloud Integrations service that automatically assumes temporary credentials for each run.

As far as VCS authentication goes, this should be done with an SSO service or similar whenever possible. Spacelift also automates this process and allows full access to the repositories necessary for deployment without any passing around of credentials.

10. Deploying Resources

The actual deployment of most resources should be an automated process. Having manual procedures for every deployment would absolutely slow the entire process down to a crawl. Requiring manual approvals should be the exception, not the rule. Using strategies outlined above, such as blast radius, OPA usage, etc., should allow you to make intelligent decisions on when to involve manual processes. 

Another step you may want to take during deployments is to test the deployment in another environment. Deploying to a test or staging environment before deploying to production is an integral part of the deployment process. Managing this pipeline, catching edge cases, and keeping it as hands-off as possible is crucial when dealing with a large number of deployments daily. 

Some organizations may deploy to a test environment first by pushing changes to a dev branch, testing fully, then run a merge with prod, which deploys again using a different context. 

Others may want to push straight to a prod branch, have the pipeline run the plan, then deploy. The difference here is that the code merge actually happens after the deployment has been made. This is sometimes referred to as a “pre-merge apply” and is commonly used within the Atlantis open-source IaC deployment tool. This strategy allows the main branch to remain clean if there are issues encountered. You definitely want to be cautious about your blast radius here. Even if the plans are usually pretty reliable, things do happen.

Check out 7 steps to optimize your CI/CD pipelines for scaling.

Key Points

Thanks to the issues mentioned above, among others, the world of infrastructure deployments can be akin to the famous shootouts of the American Wild West. Unfortunately, unlike many of the stories you hear, Wyatt Earp doesn’t go riding off into the sunset.

On the contrary, the entire organization suffers, and technical debt begins to pile up. As bandaids are used to allow developers access to what they need to deploy emergency patches and security “workarounds” are implemented by developers to get their jobs done without having to open a ticket for every little thing they need to deploy, everything starts to collapse.

By following the tips in this article and taking the time to really map out your processes while engaging all teams and stakeholders involved, you will be able to scale your infrastructure deployments as far as you need.

Automation and Collaboration Layer for Infrastructure as Code

Spacelift is a flexible orchestration solution for IaC development. It delivers enhanced collaboration, automation and controls to simplify and accelerate the provisioning of cloud based infrastructures.

Start free trial