Why Your Terraform Modules Are Too Big — And How to Fix Them

Most IaC problems aren't about syntax. They're about scope. A practical guide to Terraform module boundaries that scale with your team and survive production.

Why Terraform module size matters

After 6+ years working with Terraform in production, the most common issue I see isn't syntax errors or provider bugs — it's modules that have grown too large to maintain safely. A module that provisions a VPC, subnets, route tables, NAT gateways, security groups, and flow logs in a single unit might seem convenient at first. Six months later, a change to one security group rule requires a full plan across 47 resources.

Rule of thumb: If your Terraform plan regularly touches more than 20 resources for a single logical change, your modules are too large.

The three failure patterns

1. The monolith module

Everything in one module — networking, compute, IAM, DNS. Changes are scary because blast radius is huge. State files grow to thousands of resources. Plans take 10+ minutes.

2. The deeply nested module

Modules calling modules calling modules, 4 levels deep. Debugging a variable that isn't propagating correctly becomes archaeology. terraform graph produces something that looks like a circuit board.

3. The copy-paste module

The same VPC configuration duplicated across 6 environment directories with minor variations. One security fix requires 6 separate PRs. This is how drift starts.

The structure that actually works

Here's the folder structure I use across production environments managing 100+ servers:

modules/
  networking/
    vpc/            # VPC + subnets only
    security-groups/
    transit-gateway/
  compute/
    ec2-asg/
    eks-cluster/
    ecs-service/
  storage/
    s3-bucket/
    rds-postgres/
  iam/
    role/
    policy/

environments/
  dev/
    main.tf         # calls modules with dev vars
    variables.tf
    terraform.tfvars
  staging/
  prod/

Each module does one thing. The VPC module creates a VPC and subnets. It does not create security groups. It does not create route tables beyond what a VPC requires. If you find yourself adding an enable_nat_gateway boolean that spawns 6 different resource types, that's a sign to split.

Module interface design

The inputs and outputs of a module are its contract. Keep them minimal and explicit:

# Good — explicit, minimal
variable "vpc_cidr" {{
  type        = string
  description = "CIDR block for the VPC"
}}

variable "environment" {{
  type        = string
  description = "Environment name (dev/staging/prod)"
}}

# Avoid — a map of everything
variable "config" {{
  type = map(any)   # This is a trap
}}

When a module accepts a map(any), you lose type checking, validation, and documentation in one move. Future you will not thank past you.

State isolation strategy

Each layer gets its own state file. Networking state is separate from compute state. This means:

A broken compute deployment cannot corrupt your VPC state
Teams can work on different layers simultaneously without state lock conflicts
You can destroy and recreate compute without touching networking

# backend.tf per environment layer
terraform {{
  backend "s3" {{
    bucket = "my-tfstate"
    key    = "prod/networking/vpc/terraform.tfstate"
    region = "us-east-1"
    dynamodb_table = "terraform-locks"
  }}
}}

The refactoring approach

If you're sitting on a monolith module right now, don't try to refactor everything at once. The approach that works without breaking production:

Identify the largest, most-changed resource group in the module
Extract it into a new module with terraform state mv — no destroy/recreate
Update references, run plan, verify zero infrastructure changes
Merge, repeat with the next group

Warning: Always run terraform plan after a state mv before applying. A clean plan with zero changes confirms the refactor is safe. Any unexpected changes mean something went wrong with the state migration.

Final checklist before merging a module

Does it do exactly one thing?
Are all variables typed and described?
Are outputs minimal — only what callers actually need?
Does it have its own state backend configured?
Would a new team member understand it without asking anyone?

If you can answer yes to all five, ship it.