Voltron Studio logo black rectangle

Article

How to not f-up your infrastructure when using Terraform

David Vuong

·9 mins read

Terraform is fast becoming the most popular tool to write infrastructure as code (IaC). The first concept you'll encounter when learning Terraform is state management. It’s the most important subject because if you mess it up, you’ll find yourself pulling your hair to fix it.

Avoid common mistakes that can royally screw up your infrastructure's state.

Terraform is fast becoming the most popular tool to write infrastructure as code (IaC). The first concept you'll encounter when learning Terraform is state management. It’s the most important subject because if you mess it up, you’ll find yourself pulling your hair to fix it.

In this post we’ll talk about the beast that is Terraform state management and best practices to follow to tame it.

What is Terraform state?

Before we jump into it, what is Terraform state and why is it needed? The state is the mechanism in which Terraform tracks changes made to your infrastructure. Every resource created, updated, or deleted is tracked and recorded via the state.

Terraform uses a tfstate file to track the state. These files derive the difference between what is currently provisioned and what you want to provision. Like other IaC tools like Ansible, Terraform’s configuration language HCL (Hashicorp Configuration Language) is declarative and it uses tfstate as the starting position to reach your desired state.

Below is an example provisioning an S3 bucket and associated generated tfstate:

terraform {
  required_version = ">= 0.13"
}

provider "aws" {
  version = "~> 3.0"
  region  = "ap-southeast-2"
}

resource "aws_s3_bucket" "b" {
  bucket = "bucket-name"
  acl    = "private"
  tags = {
    Region    = "ap-southeast-2"
    Terraform = 1
  }
}

/*
{
  "version": 4,
  "terraform_version": "0.13.0",
  "serial": 1,
  "lineage": "d3833fc2-45d6-4034-8a77-8c6635d70caf",
  "outputs": {},
  "resources": [
    {
      "mode": "managed",
      "type": "aws_s3_bucket",
      "name": "bucket_name",
      "provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "acceleration_status": "",
            "acl": "private",
            "arn": "arn:aws:s3:::bucket-name",
            "bucket": "bucket_name",
            "bucket_domain_name": "bucket-name.s3.amazonaws.com",
            "bucket_prefix": null,
            "bucket_regional_domain_name": "bucket-name.s3.ap-southeast-2.amazonaws.com",
            "cors_rule": [],
            "force_destroy": false,
            "grant": [],
            "hosted_zone_id": "<redacted>",
            "id": "<redacted>",
            "lifecycle_rule": [],
            "logging": [],
            "object_lock_configuration": [],
            "policy": null,
            "region": "ap-southeast-2",
            "replication_configuration": [],
            "request_payer": "BucketOwner",
            "server_side_encryption_configuration": [],
            "tags": {
              "Region": "ap-southeast-2",
              "Terraform": "1"
            },
            "versioning": [
              {
                "enabled": false,
                "mfa_delete": false
              }
            ],
            "website": [],
            "website_domain": null,
            "website_endpoint": null
          },
          "private": "<redacted>"
        }
      ]
    }
  ]
}
*/

Terraform is declarative: if we make any changes to aws_s3_bucket.b, Terraform will compare our current state with the desired state and figure out how to arrive there. Under the hood a dependency tree and a series of AWS API calls are derived in order to get that change through. You can see this by setting TF_LOG=debug before your apply command.

Remote state management

A Terraform backend is a mechanism that describes where to store and load state. The default backend stores tfstate files on your local machine. This is reasonable when you’re prototyping but it immediately becomes a problem when you need guarantees around safety, or if you work in a team. 

Enter remote states.

When Terraform was in its infancy, there were only a few remote locations you could store state. There are now a variety of different storage locations. The most common is AWS S3. Here’s an example:

# backend.tf
bucket         = "studio.voltron.project.tfstate"
region         = "ap-southeast-2"
key            = "terraform.tf"
dynamodb_table = "terraform-state-locks"
profile        = "voltronstudio-master"

# Backends must be baked in during the initialization step:
#
# terraform init --backend-config=backend.tf

You can read more about the AWS S3 backend on their official website.

The two key areas to look at are the use of dynamodb_table and profile. These two lines of code open up the discussion to two new problems when remote state is enabled. How do you avoid tfstate file corruption from concurrent access and where do you store tfstate files across environments?

State locks solve the problem of concurrency. Locking the tfstate file prevents one engineer from accessing it while another is executing a change. The AWS S3 backend takes advantage of DynamoDB’s ACID property to avoid overlapping calls. If engineer A has acquired a lock, engineer B is blocked until the A has released the lock.

Where to store tfstate files is subjective. One approach is for each environment to have their own S3 bucket and DynamoDB instance. Each time you want to target a different environment, reinitialize Terraform with the target environment’s backend. This is fine but it’s messy and unnecessary.

A better solution is to store them in one location. Create a separate AWS account specifically for storing tfstate files. These are the benefits:

  1. You’ll only need one S3 bucket and DynamoDB instance
  2. Syncing between remote and local state is easy as there’s one location to sync
  3. It minimises risk: the separate AWS account may only be accessible by certain authorised engineers

Multiple state files per environment

A common mistake we see from newbies is only having one state per environment. This goes against best practice for one major reason. If you screw up that tfstate file, your entire environment is screwed up. You can de-risk this by separating each part of infrastructure into logical components. If a tfstate file becomes corrupt or lost, only one component is damaged.

Grouping pieces of your infrastructure into logical components during the early stages of development is almost as hard as naming things. It’s the reason why most keep it in one state. However, there are a few rules of thumb we use to help separate them:

  1. Each microservice and their resource dependency (e.g. PostgreSQL database) should be placed in a separate component
  2. Shared resources (e.g. VPC, EKS) should be placed in a shared component
  3. Components that may cause a circular dependency should be placed in a shared component
  4. Resources that are difficult to justify being in one component over the other should be placed in a shared component. For example, component A inserts a message to SQS and component B reads the same queue. Should the SQS definition belong in component A or B?

Here’s an example of what your project structure might look like:

terraform/
├── backend.tf
└── components
    ├── component_a
    │   ├── exports.tf
    │   └── main.tf
    ├── component_b
    │   ├── exports.tf
    │   └── main.tf
    └── shared
        ├── s3
        │   ├── exports.tf
        │   └── main.tf
        └── vpc
            ├── exports.tf
            └── main.tf

Terraform workspaces

As of Terraform v0.10, workspaces have become the defacto standard when handling environments. However, as suggested in the official docs, it’s not enough to just use workspaces to isolate state. If your environments are monolithic, a corrupt tfstate will screw up your entire environment. So, it’s better to combine workspaces with components.

Each component should have their own environment. Each workspace represents a component/environment pair.

Here’s an example of what terraform workspace list might look like when you combine components and environments together:

default
dev-shared
dev-component-a
dev-component-b
prod-shared
prod-component-a
prod-component-b

Sharing state between components

State is isolated. This is as designed and working as intended. However, sometimes you need to access another component’s state.

Rarely would you need this barring fundamental shared components such as VPCs or resources used between components like SQS. Terraform provides a mechanism for access, data.terraform_remote_state. Here’s an example:

data "terraform_remote_state" "vpc" {
    backend = "s3"
    config {
        bucket  = "networking-terraform-state-files"
        key     = "vpc-prod.terraform.tfstate"
        region  = "us-east-1"
    }
}

data.terraform_remote_state.vpc.private_subnet_ids

This is fine but there are a few problems. 

  • bucket, key, region must be hard coded. Yes, you can pull those into variables but they’ll still be duplicated because they’ve been defined in your backend
  • private_subnet_ids is only available if there is an output. It’s not always clear what outputs are available
  • Depending on how you’ve structured your remote state, you might accidentally pull state from a different environment accidentally
  • It feels wrong to access state files directly. Directly extracting internal hidden details from another module feels dirty and error prone

If you’re using AWS, a better approach is to use SSM Param Store.

resource "aws_ssm_parameter" "sqs_http_proxy_id" {
  name      = "/voltron/shared/sqs/http-proxy/id"
  type      = "String"
  value     = module.sqs_http_proxy.id
  overwrite = true
}

data "aws_ssm_parameter" "sqs_http_proxy_queue_url" {
  name = "/voltron/shared/sqs/http-proxy/id"
}

data.aws_ssm_parameter.sqs_http_proxy_queue_url.value

The idea here is that rather than using outputs, you create a SSM Parameter placing the value of your output as the value of the parameter. There are benefits for doing this:

  • None of your remote state config is hard coded. You’re just accessing the value of a resource like any other resource in your environment’s infrastructure
  • It’s clear exactly what keys are available. You can log into the AWS console to view them all
  • You’ll never accidentally load data from a different environment
  • Subjective but it’s a cleaner solution: Terraform can be thought of like a program. Each resource is a function and tfstate files and HCL as the implementation. Using SSM Param Store is akin to invoking a function and loading an arbitrary remote_state file is like reflecting into a class to and dynamically calling a function

Conclusion

This has been a brief intro into the fundamentals of state management in Terraform. Remote state, state locking, component/environment workspace pairs, and SSM Param Store as the mechanism for state sharing are the tip of the iceberg.

As you progress, you will find yourself digging into more advanced topics such as explicit state manipulation, resource tainting to force a change, or existing resources imports. A strong understanding of the fundamentals let you delve into those areas and be confident you can recover if anything goes wrong.

At Voltron Studio, we’re all about understanding the fundamentals before implementing a solution regardless of solving a business problem or applying a tool such as Terraform. Too often we see beginners jump into writing code rather than taking the time to piece together the right approach. Reach out. We’re happy to hear about your problem.

Keep calm and terraform.

Related tags
Software DevelopmentTerraformDevOpsAWS
Written by

David Vuong

Co-Founder & Director at Voltron Studio

Sign up for our newsletter.

Be notified when we share new ideas and updates. Stay up-to-date on news and tips in web technologies, healthcare software, and radiology!

We care about the protection of your data. Read our privacy policy.

Voltron Studio logo white text and square

© 2021 Voltron Studio Pty Ltd, ABN 72 645 265 103