Why Terraform module size matters
After 6+ years working with Terraform in production, the most common issue I see isn't syntax errors or provider bugs — it's modules that have grown too large to maintain safely. A module that provisions a VPC, subnets, route tables, NAT gateways, security groups, and flow logs in a single unit might seem convenient at first. Six months later, a change to one security group rule requires a full plan across 47 resources.
The three failure patterns
1. The monolith module
Everything in one module — networking, compute, IAM, DNS. Changes are scary because blast radius is huge. State files grow to thousands of resources. Plans take 10+ minutes.
2. The deeply nested module
Modules calling modules calling modules, 4 levels deep. Debugging a variable that isn't propagating correctly becomes archaeology. terraform graph produces something that looks like a circuit board.
3. The copy-paste module
The same VPC configuration duplicated across 6 environment directories with minor variations. One security fix requires 6 separate PRs. This is how drift starts.
The structure that actually works
Here's the folder structure I use across production environments managing 100+ servers:
modules/
networking/
vpc/ # VPC + subnets only
security-groups/
transit-gateway/
compute/
ec2-asg/
eks-cluster/
ecs-service/
storage/
s3-bucket/
rds-postgres/
iam/
role/
policy/
environments/
dev/
main.tf # calls modules with dev vars
variables.tf
terraform.tfvars
staging/
prod/
Each module does one thing. The VPC module creates a VPC and subnets. It does not create security groups. It does not create route tables beyond what a VPC requires. If you find yourself adding an enable_nat_gateway boolean that spawns 6 different resource types, that's a sign to split.
Module interface design
The inputs and outputs of a module are its contract. Keep them minimal and explicit:
# Good — explicit, minimal
variable "vpc_cidr" {{
type = string
description = "CIDR block for the VPC"
}}
variable "environment" {{
type = string
description = "Environment name (dev/staging/prod)"
}}
# Avoid — a map of everything
variable "config" {{
type = map(any) # This is a trap
}}
When a module accepts a map(any), you lose type checking, validation, and documentation in one move. Future you will not thank past you.
State isolation strategy
Each layer gets its own state file. Networking state is separate from compute state. This means:
- A broken compute deployment cannot corrupt your VPC state
- Teams can work on different layers simultaneously without state lock conflicts
- You can destroy and recreate compute without touching networking
# backend.tf per environment layer
terraform {{
backend "s3" {{
bucket = "my-tfstate"
key = "prod/networking/vpc/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
}}
}}
The refactoring approach
If you're sitting on a monolith module right now, don't try to refactor everything at once. The approach that works without breaking production:
- Identify the largest, most-changed resource group in the module
- Extract it into a new module with
terraform state mv— no destroy/recreate - Update references, run plan, verify zero infrastructure changes
- Merge, repeat with the next group
terraform plan after a state mv before applying. A clean plan with zero changes confirms the refactor is safe. Any unexpected changes mean something went wrong with the state migration.
Final checklist before merging a module
- Does it do exactly one thing?
- Are all variables typed and described?
- Are outputs minimal — only what callers actually need?
- Does it have its own state backend configured?
- Would a new team member understand it without asking anyone?
If you can answer yes to all five, ship it.