The uncomfortable truth about cloud costs
Most cloud cost problems aren't caused by expensive services — they're caused by resources nobody is watching. Over-provisioned EC2 instances that were "temporarily" sized up two years ago. S3 buckets accumulating 6 years of logs with no lifecycle policy. NAT Gateway charges nobody noticed because they're buried in the bill. Here's the systematic approach I used to achieve a 20% reduction in monthly AWS spend without touching a single production workload.
Step 1: Get visibility before you optimise
You can't cut what you can't see. Before making any changes, spend a week with AWS Cost Explorer properly configured.
# Enable Cost Allocation Tags first
# AWS Console → Billing → Cost Allocation Tags
# Activate: Environment, Service, Team, Owner
# Then filter Cost Explorer by tag to see spend per service/team
If your resources aren't tagged, tag them before doing anything else. Untagged resources are invisible costs. Set up a tag enforcement policy via AWS Config to catch new untagged resources automatically.
Step 2: Right-size EC2 instances
AWS Compute Optimizer analyses your CloudWatch metrics and tells you which instances are over-provisioned. Enable it — it's free for EC2.
# Enable Compute Optimizer via CLI
aws compute-optimizer update-enrollment-status --status Active --include-member-accounts
# Check recommendations after 14 days of metric collection
aws compute-optimizer get-ec2-instance-recommendations --region us-east-1 --query 'instanceRecommendations[?finding==`OVER_PROVISIONED`]'
In practice I find 20–30% of instances running at under 10% CPU utilisation consistently. A m5.2xlarge running at 8% CPU is a m5.large with money lighting on fire.
Step 3: S3 lifecycle policies
S3 is cheap per GB but it accumulates silently. Logs, backups, and artifacts from 3 years ago sitting in Standard storage cost real money at scale.
# Lifecycle policy via Terraform
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "log-lifecycle"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA" # 45% cheaper
}
transition {
days = 90
storage_class = "GLACIER_IR" # 68% cheaper
}
expiration {
days = 365 # delete after 1 year
}
}
}
Step 4: Kill idle NAT Gateways
NAT Gateways cost $0.045/hour plus data processing charges. An idle NAT Gateway in a dev environment costs ~$32/month. Multiply by 3 environments with 2 AZs each and you're paying $190/month for NAT Gateways in non-production environments doing almost nothing.
# Find NAT Gateways with low traffic via CloudWatch
aws cloudwatch get-metric-statistics --namespace AWS/NATGateway --metric-name BytesOutToDestination --dimensions Name=NatGatewayId,Value=nat-xxxxxxxxx --start-time 2025-01-01T00:00:00Z --end-time 2025-02-01T00:00:00Z --period 2592000 --statistics Sum
For dev/staging environments that don't run overnight, a Lambda function that deletes NAT Gateways at 8pm and recreates them at 8am saves the hourly charge during off-hours.
Step 5: Reserved Instances and Savings Plans
If you've been running the same instance types for 6+ months and plan to continue — buy Reserved Instances or Compute Savings Plans. This alone typically saves 30–40% on EC2 costs with zero configuration changes.
Step 6: Automate with Lambda
# Lambda to stop non-prod instances on a schedule
import boto3
def handler(event, context):
ec2 = boto3.client('ec2')
# Find instances tagged Environment=dev that are running
response = ec2.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['dev', 'staging']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
instance_ids = [
i['InstanceId']
for r in response['Reservations']
for i in r['Instances']
]
if instance_ids:
ec2.stop_instances(InstanceIds=instance_ids)
return {'stopped': instance_ids}
The order that works
- Tag everything — get visibility first
- Right-size with Compute Optimizer recommendations
- Add S3 lifecycle policies to all log and backup buckets
- Audit NAT Gateways in non-production environments
- Schedule non-prod instances off overnight
- Buy Savings Plans for stable production workloads
In that order, with 6 weeks of work, this is how you get to a 20% reduction without touching a single production workload.