5 Terraform Anti-Patterns I See in Every Startup Codebase

I onboarded a new client last quarter and asked to see their Terraform. They sent me a single file. 1,847 lines. One main.tf containing their entire AWS infrastructure: VPC, subnets, security groups, RDS, ECS cluster, ALB, Route53, IAM roles, S3 buckets, CloudFront distribution. All of it. One file. No modules. No variables for anything that changes between environments. Their staging and production configs were two copies of this file with hardcoded values changed by hand.

They are not unusual. I see some version of this at about four out of five startups that ask me for help with their infrastructure.

Here are the five patterns that come up every time, ranked by how much damage they cause.

One: everything in a single state file

This is the one that will eventually cause an outage.

Terraform stores the current state of your infrastructure in a state file. When you run terraform apply, it compares the desired state (your .tf files) against the current state (the state file) and figures out what to change. If all your infrastructure lives in one state file, every apply operation locks and evaluates everything. A change to a DNS record requires Terraform to also evaluate your database, your compute cluster, your networking layer.

At small scale this is fine. At 200 resources it starts to get slow. At 500 it takes minutes. At 1,000 you are waiting 8 to 12 minutes for a plan, and any failure during that plan leaves the state lock held until someone manually releases it.

The worse problem is blast radius. A typo in a security group rule, applied against a monolithic state, can theoretically touch your database. I have seen exactly this happen. An engineer changed an ingress rule, Terraform's dependency graph pulled in an RDS modification that had been sitting in the diff unnoticed, and staging went down for three hours.

Split your state by layer. Networking in one state. Data stores in another. Compute in a third. DNS and CDN in a fourth. Each one can be planned and applied independently. The blast radius of any single apply shrinks from "everything" to "one layer."

Two: no remote state backend

The terraform.tfstate file defaults to local storage. On the engineer's laptop. In the project directory. Sometimes committed to Git (which is its own category of problem, since state files contain secrets in plaintext).

When two engineers run terraform apply against a local state file, the second one overwrites the first one's changes. This is not a theoretical risk. I have watched it happen at a company where two people both applied against production on the same morning, each with a different local state, and the resulting infrastructure was a blend of both that matched neither.

Remote state with locking solves this completely. An S3 bucket with a DynamoDB table for locking, or Terraform Cloud, or any other supported backend. The state lives in one place, and only one person can apply at a time. Setup takes about 30 minutes.

Three: no modules

The client with the 1,847-line file was creating ECS services by copying and pasting 60 lines of resource blocks and changing the service name, container image, and port. They had seven services. That is 420 lines of nearly identical Terraform that drifts every time someone updates one service and forgets to update the others.

Modules exist for this. A module is a reusable Terraform component with input variables and output values. Define your ECS service pattern once as a module, then call it seven times with different parameters. When the pattern needs to change, you change the module, and all seven services update together.

I am not suggesting you build an internal module library on day one. That is over-engineering. But when you copy and paste the same resource block for the third time, it is time to extract a module. The threshold is three. Two duplicates are fine. Three means you need abstraction.

Four: hardcoded values everywhere

Environments differ. Staging has a smaller database instance, fewer container replicas, a different domain. Production has the big instance, more replicas, the real domain. When these differences are hardcoded into the Terraform files, creating a new environment means reading through every file and finding every value that needs to change.

I have seen engineers spin up a new staging environment from a production config and accidentally provision an r6g.2xlarge RDS instance that cost $25 per day for a staging database with 50 rows of test data. It ran for eleven days before anyone noticed. $275 for nothing.

Variables and .tfvars files. Every value that differs between environments should be a variable with a sensible default. The production-specific values go in prod.tfvars, staging in staging.tfvars. Creating a new environment becomes terraform apply -var-file=newenv.tfvars instead of a find-and-replace across 2,000 lines of HCL.

Five: no drift detection

Terraform assumes it is the only thing managing your infrastructure. When someone makes a change through the AWS console (and someone always does, usually during an incident at 2 AM), Terraform does not know about it. The state file says one thing, reality says another.

This is called drift. Left undetected, it accumulates. The next terraform apply will try to reconcile the state file with your .tf files, but it does not know about the manual console changes, so it might revert them. I have seen a manually added security group rule (added to fix a production issue) get silently removed by a Terraform apply three days later, re-breaking the thing it fixed.

Run terraform plan on a schedule. A weekly cron job that plans against production without applying, and sends the output somewhere visible. If the plan shows changes you did not make, someone clicked something in the console. Investigate before applying.

Some teams use Terraform Cloud's drift detection for this. Others use a simple GitHub Action that runs terraform plan nightly and posts the output to Slack. Either works. The important thing is that someone is looking.

The common thread

All five of these are infrastructure management problems, not Terraform problems. Terraform is not opinionated enough to prevent them, so they happen by default at every company that grows faster than its infrastructure practices mature.

Fixing all five in an existing codebase takes one to two weeks of focused work. Preventing them in a new project takes about two hours of initial setup. The difference in ongoing maintenance cost is enormous.

Based on infrastructure audits across multiple startups and engineering teams.

Share this article

Have a technical challenge?

I help startups and teams build web applications, AI integrations, and development workflows. Let's talk about your project.

Book a free consultation