Terraform for scale-ups

Introduction

I've been building systems for almost 15 years which is crazy to think about... I've also been building infrastructure (almost exclusively IaC) for around 7 years. I started with kubernetes running API driven microservices on AWS, then quickly moved to cloud native architectures using terraform, and now (and for the past 5 years) I've been on the serverless bandwagon. In this post, I'll share my experience with Terraform and how it has helped me scale up my infrastructures as a developer.

"what is Terraform"

I've seen so many posts that start this way, "what is terraform", honestly It's off-putting so let me be brief; It's a way of codifying your cloud infrastructure. You write your infrastructure as code (IaC) in a declarative way, and Terraform takes care of provisioning and managing the resources for you. It's like having a blueprint for your cloud infrastructure that you can version control, share, and reuse.

Let's paint the picture

You're a founding engineer at a startup, you've been building the product for a few months, and now you're ready to scale up. You've got a few 1000 users, and you're starting to see some traction and now - the shift has started, your company is starting to bring on new engineers, team size is no longer 5 its 50 - we're going to be moving away from an MVP/early product to a scalable solution. You've got resources scattered across different accounts, regions, and services. You're using a mix of manual provisioning, scripts, and ad-hoc solutions. It's time to get serious about your infrastructure and make sure it can grow at the same speed of the business.

Decision time

We are going to be building a lot of new functionality onto the system, and we want a safe, easy to provision, repeatable, and secure way of building it. First we'll need to decide on a strategical architecture decision for the business, I'll not go into detail about this decision but make sure it is a decision that can scale with the business. For example, if your building a modern reactive web application that is data driven with requirements such as audibility, traceability, scalability (all the -ilities) you might want to check out a previous post I wrote: An Introduction to Event Sourcing

Once we have our architecture it's time to decide on our terraform architecture or better described as terraform design pattern - I don't see many people talking about this decision, but it's a critical one. We need to decide on how we want to structure our terraform code, how we want to manage our state, and how we want to deploy our infrastructure. There are many ways to do this, but I'll share my experience with a few different approaches. Let's go over the things I usually consider.

State Management

If you've worked with terraform you'll already be locking and using remote state - state management in this context is more about how we'll split and manage our state files. I usually start by having one monolithic state per environment - initially this is a good starting point. Its' important to note - although the state is monolithic our IaC won't be, bringing us to the next point.

A typical monolithic remote state backend configuration:

terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/terraform.tfstate"
    region = "eu-west-1"
  }
}

resource "aws_s3_bucket" "main" {
  bucket = "monolithic-bucket"
}

As your infrastructure grows, splitting state by domain or environment becomes essential:

terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "prod/network/terraform.tfstate"
    region = "eu-west-1"
  }
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

Ownership

Ownership is a key aspect of scaling up your infrastructure. As your team grows, you'll want to ensure that different teams or individuals can own specific parts of the infrastructure. This means splitting your terraform code into smaller, manageable modules that can be owned by different teams. To increase visibility and understanding of what a service is; having each service owner also owning the infrastructure for this service. For example, if you're building a web application, you might have a team responsible for the frontend, another team responsible for the backend, and another team responsible for the database. Each team can own their own terraform modules and manage their own state files.

A simple example of a component module (for an S3 bucket):

resource "aws_s3_bucket" "this" {
  bucket = var.bucket_name
}

variable "bucket_name" {
  description = "The name of the S3 bucket"
  type        = string
}

A block module might compose several components, such as an API Gateway, Lambda, and DynamoDB table:

module "api_gateway" {
  source = "../api-gateway"
  name   = var.api_name
}

module "lambda" {
  source = "../lambda"
  function_name = var.lambda_name
}

module "dynamodb" {
  source = "../dynamodb"
  table_name = var.table_name
}

A service module brings blocks together to represent a business domain:

module "web_api" {
  source      = "../../modules/web-api"
  api_name    = "my-web-api"
  lambda_name = "my-lambda"
  table_name  = "my-table"
}

Modularity

Modules are obviously critical - they allow you to encapsulate and reuse your infrastructure code. So let's speak about the three types of initial modules I usually start with:

Components Starting with components; closely tied to the resources they are provisioning, these are the building blocks of your infrastructure. They are usually small, reusable pieces of code that can be used to provision specific resources. For example, you might have a component module for provisioning an S3 bucket, another for provisioning a DynamoDB table, and another for provisioning an IAM role.

Blocks Blocks are more usable from an engineering point of view - these modules are usually split into two categories; a collection of components that work together to provide a specific functionality or a domain specific module that provides a specific functionality. An example of a block could be a web-api block that contains several components such as an API Gateway, Lambda function, and DynamoDB table. Blocks are more focused on the functionality they provide rather than the resources they provision.

Services The final type of module in this architecture is services; these modules are tightly coupled to the domain of the system - exclusively defined by blocks these modules are easy ways for us to identify what parts of the system are defined where. An example of a service module could be a web application service that uses a web-api block.

So in summary; components are tightly coupled resources that are tied to a provider. blocks are collection of components (or domain specific modules) that provide a specific functionality. Finally, services are collections of blocks that provide a specific domain functionality.

The following diagram illustrates the inheritance of modules in this approach:

graph LR
%%is-centered
  Components --> Blocks
  Blocks --> Services

Diagram illustrating the inheritance of modules

I want to dive more into the terraform design patter choices here but let me first point out some standards that I follow, and have been for quite some time.

Some standards

Some years ago I came up with some internal standards within a company I worked with - I've pretty much been following these standards since. However, Google released a terraform best practices document that I highly recommend reading - It pretty much closely aligned with what I push. Here are some of the key points I follow:

Naming Conventions

Use lowercase for resource names
Use underscores to separate words: resource_name
Be descriptive
But most importantly - make sure you establish a naming convention within your setup that is followed.

A typical naming convention for AWS resources in Terraform:

resource "aws_iam_role" "app_server_role" {
  name = "app_server_role"
  # ...
}

Variables

Variables should live inside a variables.tf file
Empty defaults should only be used when the underlying infrastructure can function within it - treat these as optional parameters.
More variables equals more complexity - consider your team expertise when building modules and exposing variables.
Always give variables types and descriptions
Validation is important but should be used sparingly - don't overcomplicate your modules.

Example of a well-documented variable in a module:

variable "bucket_name" {
  description = "The name of the S3 bucket"
  type        = string
}

Outputs

Outputs should live inside an outputs.tf file
Outputs should be descriptive and provide useful information about the resources being provisioned.
Output most values of the resources you provision, this will allow you to use these outputs in other modules without having to refactor the module.
The dependency graph; avoid passing variables as outputs to other modules, instead use the outputs of the module/resource directly. This will allow you to keep the dependency graph clean and avoid circular dependencies.

Example output definition:

output "bucket_arn" {
  description = "The ARN of the S3 bucket"
  value       = aws_s3_bucket.this.arn
}

Module Documentation

Highly recommended to have a README.md file in each module and to automate the generation of this file. Checkout terraform-docs

Data

Keep data in a data.tf file. This isn't an critical one as there are times when you might only need one data object - arguments for avoiding the data.tf file can be made here.

State management

Use remote state management to avoid conflicts and ensure consistency across environments.
Use state locking to prevent concurrent modifications to the state file.

Be declarative

This one is required for me; I see a lot of people using imperative code in their terraform modules. This is a big no-no for me, terraform is a declarative language and should be used as such. Avoid using loops, conditionals, and other imperative constructs in your modules. Instead, use the built-in functions and resources to achieve the desired outcome. We're not trying to write a program here, we're trying to define infrastructure, the key focus should be on how easy it is to understand the code and how easy it is to maintain. Imperative code can make this difficult, so avoid it where possible.

A declarative resource definition is clear and maintainable:

resource "aws_s3_bucket" "example" {
  bucket = "my-declarative-bucket"
}

By contrast, imperative patterns (such as using null_resource and local-exec) should be avoided for core infrastructure logic:

resource "null_resource" "example" {
  provisioner "local-exec" {
    command = "aws s3 mb s3://my-bucket"
  }
}

Versioning

Use semantic versioning for your modules and services.
When publishing modules, use a version control system like Git to manage changes and track history.
Avoid using latest or master branches for production code, as this can lead to unexpected changes or can be misleading.
Use tags to mark specific versions of your modules and services, and use these tags in your Terraform code to ensure that you're using the correct version.
Ultimately you should avoid using things that aren't very descriptive like a git commit hash, but we've all been there - so if you do use a commit hash, make sure it's clear what the commit is for and what changes it includes - I've even pushed for linking the commit hash to a pull request or issue in the commit message to provide more context when debugging might occur.

For example, reference a tagged module version rather than a branch or commit hash:

module "vpc" {
  source  = "git::https://github.com/terraform-aws-modules/terraform-aws-vpc.git?ref=v3.0.0"
  name    = "my-vpc"
  cidr    = "10.0.0.0/16"
}

Terraform Design Patterns

Terralith - A monolithic approach

Terralith is a design pattern that I avoid for the most part. It is a monolithic approach to managing your infrastructure. The idea is to have one large Terraform module that contains all the resources and configurations for your entire infrastructure. To give it some weight - If you are building something that you know is going to be small and won't need modules it can be useful for small teams or projects where the infrastructure is relatively simple and doesn't require a lot of complexity. However, as your team grows and your infrastructure becomes more complex, this approach can become difficult to manage and maintain. I highly advise people avoid this pattern.

graph TD
%%is-centered
  A[Monolithic State] --> B[All Resources]
  B --> C[Single Team]

Terramod - A modular approach

Terralmod is a design pattern that I highly recommend for most projects. It is a modular approach to managing your infrastructure. The idea is to break down your infrastructure into smaller, reusable modules that can be easily managed and maintained. This approach allows you to have a clear separation of concerns, making it easier to understand and manage your infrastructure. It also allows you to reuse modules across different projects, reducing duplication and improving maintainability.

graph TD
%%is-centered
  A[Root Module] --> B[Component Module 1]
  A --> C[Component Module 2]
  A --> D[Component Module 3]

Terraservice - A service-oriented approach

Terraservice is a design pattern that I usually go for when building infrastructure with many services and many teams. The idea is to break down your infrastructure into smaller, service-oriented modules that can be easily managed and maintained by the teams that own them. This approach allows you to have a clear separation of concerns, making it easier to understand and manage your infrastructure. It also allows you to reuse modules across different projects, reducing duplication and improving maintainability. This pattern is similar to Terramod but focuses more on the ownership aspect of the infrastructure.

graph TD
%%is-centered
  S1[Service 1] --> M1[Block A]
  S1 --> M2[Block B]
  S2[Service 2] --> M3[Block C]
  S2 --> M4[Block D]
  M2 --> C2[Component Module 2]
  M2 --> C3[Component Module 3]
  M3 --> C4[Component Module 4]
  M1 --> C4
  M3 --> C1[Component Module 1]
  M4 --> C1
  M1 --> C1
  M1 --> C2

Conclusion

Overall, I think the biggest takeaway and TLDR's are:

Set internal standards
Avoid complexity when possible
Adopt a terraform design pattern that suits your team and business

Reading Recommendation

If you're looking for more information on Terraform and IaC, I highly recommend the following resources: