10 Strategies to Build and Manage Scalable Infrastructure

Contributed by Spacelift | Originally published on spacelift.io

💡 The first in a series of articles on managing infrastructure

Cloud-based infrastructure is exceptionally good at scaling to support millions of users. It seems effortless, but things start getting complicated when you move beyond a simple autoscaling group and a load balancer. Once you start dealing with Terraform, Kubernetes, Ansible, and other Infrastructure as Code (IaC) tools, your codebase begins to swell.

A codebase swells usually because numerous engineers are working on different parts of it. As the number of people involved grows, so does the potential for mistakes. Things like syntax errors or the occasional forgotten comment can be mitigated quickly and harmlessly, but mistakes such as a leaked security key, improper storage security setting, or an open security group could prove disastrous.

That’s why it’s important to investigate ways to manage your infrastructure as it really starts to scale.

IaC tools and the non-scalable way

The non-scalable way of managing IaC remains very popular and continues to be useful for single-developer shops and small stacks that just don’t want to deal with the added complexity and abstraction of the methods we will discuss in this article.

Some non-scalable strategies, such as using git repositories and basic security controls, are absolutely crucial at any scale if you don’t want your code to disappear after one fateful lightning strike, but many others are not entirely necessary when starting out.

Being pragmatic and writing readable and maintainable code is incredibly important when building, but eventually, you’ll need to start preparing for scale. Let’s take a look at some of the issues with the non-scalable way of doing things.

Terminals and silos

When infrastructure developers run infrastructure deployments from their terminals, consistency almost invariably takes a major hit.

Developers coding from their terminals is perfectly acceptable. In fact, the idea that developers should be tied to a remote terminal managed by the company is a vision that only the most diehard corporate security evangelists cherish. Developers need the freedom to manage their environment completely, but that freedom should be demarcated at deployment.

These terminals should have access to the repository to which the code is being committed and nowhere further. Handing out access keys to cloud environments is just asking for trouble. It seems easy when you’re deploying a few environments, but when you’re deploying to hundreds of environments or more, things begin to get pretty unwieldy. It’s not just a conversation about key rotation; you also have to consider traversal attacks and privilege escalation.

So, keeping everything managed in a more controlled environment greatly reduces your attack surface. Here are some ways to manage your infrastructure as it starts to scale:

1. Enter GitOps

We’ve established your developers are not going to leave their terminals. It is unrealistic to deprive developers of the ability to control their environment to maximize their personal productivity. So how do we break down the silos and ensure collaboration is seamless? GitOps is the answer.

The idea of a Git repository is a single source of truth for all code and the beginning of your deployments. GitOps leverages a Git workflow with continuous integration and continuous delivery (CI/CD) to automate infrastructure updates. Any time new code is merged, the CI/CD pipeline implements that change in the environment.

By centralizing the deployment mechanisms, it’s much easier to ensure everyone is able to deploy when they need to, with the right controls and policies in place. Also, by requiring all deployed code to be checked into a repository, this removes the silos and allows visibility into what your developers are working on.

2. Monorepo vs. polyrepo

Choosing between a typical polyrepo (multiple repositories) strategy or a monorepo (a single repository with all code) can be difficult.

Facebook uses the monorepo strategy, trusting all their engineers to see all of the code. This makes code reuse much easier and enables a more simplistic set of security policies. If you’re an employee of Facebook, you get access. It’s that simple. However, given the level of trust required, this strategy can prompt issues when there are many divisions.

For companies such as financial institutions, which operate in strict regulatory environments, it is important to be able to restrict the visibility of each division’s code to the relevant engineers. The same goes for companies with multiple skunkworks-style divisions, where leaks can create disastrous consequences for marketing and legal teams.

For these companies, a polyrepo strategy is best. Managing the security of these to prevent traversal attacks and the like is the top priority for security teams, and permissions should be audited frequently. Once your repository structure is set up, your GitOps strategy can commence. Unfortunately, not everything can be managed within the repository.

3. State management

State management is a huge issue (when the state is concerned). Terraform is a clear candidate for state-management issues. Other IaC tools, such as Ansible and Kubernetes, don’t encounter state-related issues, but sensitive artifacts must be managed.

Unfortunately, state can be problematic for the GitOps paradigm because you definitely don’t want to store state in your repository. Because of this, many companies use a combination of S3 and DynamoDB for their state management. Similar offerings from other cloud providers will also work, but we’ll focus on the most common for simplicity.

If you use Cloudformation or Arm/Bicep, your patience navigating cluttered syntax to avoid managing state is commendable.

Anyone using tools such as Terraform Cloud, Spacelift, or open source alternatives like Atlantis won’t find much to use in this section because these tools manage Terraform state for you.

Once you start managing workloads that could be considered scale, managing state is not the first thing you want to have to think about. Securing your state buckets, providing the right amount of access to those that need it, and managing the DynamoDB tables to maintain integrity through locks are all crucial elements to managing the state yourself. It’s not incredibly difficult, but the repercussions of a mistake are dire.

Always ensure you encrypt your state and use any features available within your IaC tool to help keep tabs on sensitive values. In Terraform, you can mark values as sensitive to ensure they don’t leak into the terminal, but these values are still in the state to be consumed by anyone with access. Everything you need is in the documentation, so just make sure you keep things secure and manage locks properly to avoid integrity issues.

4. Barbell security

The concept of “shift left” security puts much of the onus of security on developers, but the security team must also impose a set of controls on the other end of the spectrum in the deployed environment. This makes humans responsible for the bulk of responsibility at either end of the security barbell.

Using GitOps to implement security testing through the entire pipeline is critical to a scalable deployment. This seems like a no-brainer, but many organizations place this on the back burner or opt to perform basic code-linting without any actual policies.

Using a policy management platform, such as Open Policy Agent (OPA), is critical to keeping your security footprint in check at every step of the deployment. Leveraging OPA, you can automate compliance checks within your CI/CD pipelines, providing a powerful safeguard against potentially malicious jobs and services running on your systems and increasing the efficiency of the development process.

By doing this, you’ll ensure you don’t have the weight on both sides of the deployment, and you distribute it more evenly throughout the process.

We’ll dive a little deeper into these processes throughout the article.

5. Policies reduce blast radius

Ensuring the minimum number of resources are affected if something goes wrong is an intrinsic element of scaling. When dealing with thousands of resources, one fateful configuration disaster can take hours, or even days, to recover from. Some of you may remember the S3 outage caused by a misconfigured “Playbook”: https://aws.amazon.com/message/41926/

If AWS had implemented policies that required confirmation when a certain number of critical resources were modified, this whole incident could have been avoided.

One of the most common ways to configure these policies is to use a policy-as-code tool such as OPA and a clever “scoring” algorithm to score your resources based on their importance. If the resources set to be modified have a score above a threshold, it can require manual intervention from management, SREs, or whoever is on the list.

For example,

EC2 instance in a 100 EC2 instance autoscaling pool: 1 point

Redundant Load Balancer in a dual LB setup: 25 points

Production Database: 100 points

You could easily set a threshold using OPA’s scripting language, Rego, to require authorization if the total points > 49. This would ensure you don’t lose two load balancers — more than half your EC2 instances — and definitely not your production database.

Here is another example written in Rego that illustrates the concept:

package spacelift

# This policy attempts to create a metric called a "blast radius" - that is how much the change will affect the whole stack.
# It assigns special multipliers to some types of resources changed and treats different types of changes differently.
# deletes and updates are more "expensive" because they affect live resources, while new resources are generally safer
# and thus "cheaper". We will fail Pull Requests with changes violating this policy, but require human action
# through **warnings** when these changes hit the tracked branch.

proposed := input.spacelift.run.type == "PROPOSED"

deny[msg] {
	proposed
	msg := blast_radius_too_high[_]
}

warn[msg] {
	not proposed
	msg := blast_radius_too_high[_]
}

blast_radius_too_high[sprintf("change blast radius too high (%d/100)", [blast_radius])] {
	blast_radius := sum([blast |
		resource := input.terraform.resource_changes[_]
		blast := blast_radius_for_resource(resource)
	])

	blast_radius > 100
}

blast_radius_for_resource(resource) = ret {
	blasts_radii_by_action := {"delete": 10, "update": 5, "create": 1, "no-op": 0}

	ret := sum([value |
		action := resource.change.actions[_]
		action_impact := blasts_radii_by_action[action]
		type_impact := blast_radius_for_type(resource.type)
		value := action_impact * type_impact
	])
}

# Let's give some types of resources special blast multipliers.
blasts_radii_by_type := {"aws_ecs_cluster": 20, "aws_ecs_user": 10, "aws_ecs_role": 5}

# By default, blast radius has a value of 1.
blast_radius_for_type(type) = 1 {
	not blasts_radii_by_type[type]
}

blast_radius_for_type(type) = ret {
	blasts_radii_by_type[type] = ret
}

In the above example, you can see the different “blast_radii_by_type” and their definitions:

“aws_ecs_cluster”: 20

“aws_ecs_user”: 10

“aws_ecs_role”: 5

And the deny rule states that if “blast_radius_too_high” is true, then to deny the run. The current threshold is set at > 100. This obviously gets much more complicated as you start working with a significant level of infrastructure, but it’s a great starting point.

6. Modules

By limiting infrastructure developers’ ability to deploy arbitrary resources and attributes, you help constrain the blast radius and solidify the security posture you require. By coupling strict linting and code-scanning rules with a module-based policy, you can ensure only the right settings get deployed. This also helps with costing because you can ensure only certain resources within your budgetary constraints are available within the modules.

Here is an example of a module-oriented structure:

Modules are created by the infrastructure team responsible for that module — networking, security, compute, etc.
Condition Checks are added to ensure proper constraints.
Pre-commit hooks lint and scan for any vulnerabilities using tools such as tfsec or checkov.
Modules are committed to VCS, where they are tested or consumed by a module registry.
If you are using a platform that has a private module registry with built-in testing, those tests can be run on the modules to ensure they operate correctly.
Policies are written to ensure deployments to break any constraints. Open Policy Agent is great for this. Terraform Cloud also offers Sentinel.
Engineers are given access to these modules as their roles dictate. All of the guardrails are in place, and they are unable to configure parameters outside of those boundaries without triggering policies in place.
Code is deployed, and final compliance tests are performed within the environment to ensure everything is still up to standards. This could be using services such as AWS Cloudtrail or GCP Cloud Audit Logs, for example.

7. RBAC

Role-Based Access Control (RBAC) is a very important aspect of your security posture. Kubernetes has native RBAC that is extremely useful when coupled with Namespaces. Terraform doesn’t exactly have an RBAC system, but you can use a module-based structure as we discussed previously to help ensure standards are maintained.

8. Secret management

Secret management is important whether you’re a two-person startup or a 200,000-person corporation. Protecting your secrets is a foundational concept that doesn’t require much explanation.

Verifying that secrets never enter your repository to begin with is a more crucial element as your codebase scales. With a few thousand lines of code in a monorepo, it can be fairly straightforward to manage your secrets, but once you get into millions of lines of code, things get complicated.

Ensure you’re using code-scanning tools, encryption utilities, and anything else at your disposal that can help ensure this doesn’t happen. AWS Secrets Manager, Azure Key Vault, and Hashicorp Vault are some applications that can assist with this. And, of course, if you do have to have static credentials instead of recommended short-lived ones, ensure they are rotated frequently.

Learn best practices for secret management with GitOps by watching this video.

9. Shareable variables or files

As you build scalable infrastructure, you are likely to have many CI/CD pipelines running in the background. These pipelines need to get inputs from somewhere, and if you don’t leverage an approach that takes advantage of shareable variables or files, you will probably end up with repetitive code, or at least with duplication of these variables or files.

Shared variables are significant from both an infrastructure and an application standpoint. If you think about infrastructure, the first thing that comes to mind is the authentication to your cloud provider. Given that you are likely to have multiple automations that need to interact in some way with your cloud infrastructure, you will need to set up multiple credentials. Shared variables make this easier because you set up the variables just once, and you can then consume them in all the things you are building. These variables should be secured and factor in everything mentioned under secret management and RBAC before they are implemented.

In other cases, you can use shared variables to simply speed up the process of sharing data among multiple points inside your workflow. Many available tools can help with this type of action, and most of them have some sort of secret-management mechanism embedded in them.

Sharing files is crucial for various big data applications. Bear in mind you have many IoT devices that write data to json files, hundreds of GBs daily. The data from these files should be easily accessible by some of your automations and it should be manipulated to render whatever you need inside your application. This would be almost impossible without the capability to share these files inside a distributed system because the time required to extract the data in a meaningful way would be much longer than the time required to collect it. Most cloud providers offer solutions for this kind of exercise, and there are also open source alternatives you can install and manage yourself if you want to build something in-house.

Sharing variables and files between nodes, processes, and tools, optimizes your ability to make faster deployments, reducing the potential for human error and also making systems more efficient and robust.

10. Deploying resources

The actual deployment of most resources should be an automated process. With continuous delivery, deployment is automated, eliminating virtually all complexity. Manual approvals require considerable human intervention, making the process more complicated and slowing it to a crawl. Requiring them should be the exception, not the rule. You can use the strategies we’ve outlined — blast radius, OPA usage, etc.— to make intelligent decisions on when to involve manual processes.

Another step you can take during deployments is to test the deployment in another environment. Deploying to a test or staging environment before deploying to production is an integral part of the deployment process. Managing this pipeline, catching edge cases, and keeping it as hands-off as possible is crucial when dealing with a large number of deployments daily.

Some organizations may deploy to a test environment first by pushing changes to a dev branch, testing fully, and then running a merge with prod, which deploys again using a different context.

Others may want to push straight to a prod branch, have the pipeline run the plan, and then deploy. The difference here is that the code merge actually happens after the deployment has been made. This is sometimes referred to as a “pre-merge apply” and is commonly used within the Atlantis open source IaC deployment tool. This strategy allows the main branch to remain clean if issues are encountered. You need to be cautious about your blast radius here because incidents can happen even if the plans are usually reliable.

Where next?

Looking at the issues we’ve discussed here, it’s easy to see just how unwieldy and error-prone infrastructure deployments can become. Entire organizations suffer as technical debt mounts and developers use workarounds and emergency patches to get their jobs done to avoid opening a ticket for every little thing they need to deploy. Things can quickly start to collapse.

By following the guidance in this article and taking the time to really map out your processes while engaging all teams and stakeholders involved, you will be able to scale your infrastructure deployments as far as your ambition requires. In future articles, we’ll go into greater depth, examining various aspects of managing infrastructure in more detail.