AWSFinOpsTerraform Sep 2024 10 min read

AWS Cost Optimization: How to Cut Cloud Spend by 20% Systematically

A systematic approach to reducing AWS costs — from tagging and right-sizing to S3 lifecycle policies and NAT Gateway audits. The same process that achieved 20% savings without touching production.

The uncomfortable truth about cloud costs

Most cloud cost problems aren't caused by expensive services — they're caused by resources nobody is watching. Over-provisioned EC2 instances that were "temporarily" sized up two years ago. S3 buckets accumulating 6 years of logs with no lifecycle policy. NAT Gateway charges nobody noticed because they're buried in the bill. Here's the systematic approach I used to achieve a 20% reduction in monthly AWS spend without touching a single production workload.

Step 1: Get visibility before you optimise

You can't cut what you can't see. Before making any changes, spend a week with AWS Cost Explorer properly configured.

# Enable Cost Allocation Tags first
# AWS Console → Billing → Cost Allocation Tags
# Activate: Environment, Service, Team, Owner

# Then filter Cost Explorer by tag to see spend per service/team

If your resources aren't tagged, tag them before doing anything else. Untagged resources are invisible costs. Set up a tag enforcement policy via AWS Config to catch new untagged resources automatically.

Step 2: Right-size EC2 instances

AWS Compute Optimizer analyses your CloudWatch metrics and tells you which instances are over-provisioned. Enable it — it's free for EC2.

# Enable Compute Optimizer via CLI
aws compute-optimizer update-enrollment-status   --status Active   --include-member-accounts

# Check recommendations after 14 days of metric collection
aws compute-optimizer get-ec2-instance-recommendations   --region us-east-1   --query 'instanceRecommendations[?finding==`OVER_PROVISIONED`]'

In practice I find 20–30% of instances running at under 10% CPU utilisation consistently. A m5.2xlarge running at 8% CPU is a m5.large with money lighting on fire.

Step 3: S3 lifecycle policies

S3 is cheap per GB but it accumulates silently. Logs, backups, and artifacts from 3 years ago sitting in Standard storage cost real money at scale.

# Lifecycle policy via Terraform
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id

  rule {
    id     = "log-lifecycle"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"    # 45% cheaper
    }
    transition {
      days          = 90
      storage_class = "GLACIER_IR"     # 68% cheaper
    }
    expiration {
      days = 365                         # delete after 1 year
    }
  }
}

Step 4: Kill idle NAT Gateways

NAT Gateways cost $0.045/hour plus data processing charges. An idle NAT Gateway in a dev environment costs ~$32/month. Multiply by 3 environments with 2 AZs each and you're paying $190/month for NAT Gateways in non-production environments doing almost nothing.

# Find NAT Gateways with low traffic via CloudWatch
aws cloudwatch get-metric-statistics   --namespace AWS/NATGateway   --metric-name BytesOutToDestination   --dimensions Name=NatGatewayId,Value=nat-xxxxxxxxx   --start-time 2025-01-01T00:00:00Z   --end-time 2025-02-01T00:00:00Z   --period 2592000   --statistics Sum

For dev/staging environments that don't run overnight, a Lambda function that deletes NAT Gateways at 8pm and recreates them at 8am saves the hourly charge during off-hours.

Step 5: Reserved Instances and Savings Plans

If you've been running the same instance types for 6+ months and plan to continue — buy Reserved Instances or Compute Savings Plans. This alone typically saves 30–40% on EC2 costs with zero configuration changes.

Do this last: Buy reservations only after right-sizing. Buying a 1-year reservation on an over-provisioned instance locks in waste for 12 months.

Step 6: Automate with Lambda

# Lambda to stop non-prod instances on a schedule
import boto3

def handler(event, context):
    ec2 = boto3.client('ec2')
    
    # Find instances tagged Environment=dev that are running
    response = ec2.describe_instances(
        Filters=[
            {'Name': 'tag:Environment', 'Values': ['dev', 'staging']},
            {'Name': 'instance-state-name', 'Values': ['running']}
        ]
    )
    
    instance_ids = [
        i['InstanceId']
        for r in response['Reservations']
        for i in r['Instances']
    ]
    
    if instance_ids:
        ec2.stop_instances(InstanceIds=instance_ids)
        return {'stopped': instance_ids}

The order that works

  1. Tag everything — get visibility first
  2. Right-size with Compute Optimizer recommendations
  3. Add S3 lifecycle policies to all log and backup buckets
  4. Audit NAT Gateways in non-production environments
  5. Schedule non-prod instances off overnight
  6. Buy Savings Plans for stable production workloads

In that order, with 6 weeks of work, this is how you get to a 20% reduction without touching a single production workload.

← Back to all articles