Cloud & Infrastructure · 9 min read

Cloud cost overruns: the four decisions that cause 80% of them

Published 2026-04-08

Cloud bills rarely die from any single decision. They die from four compounding choices, and in most mid-market IT organizations nobody owns accountability for any of them.

Heartwood has looked inside the cloud bills of more than forty mid-market companies in the past five years. The pattern is uncannily consistent: the bill doubles over eighteen months, nobody can point to the event that caused it, and the conversation about "getting cloud costs under control" starts about three months after the CFO first sees the number.

What surprises people is how narrow the real problem is. In almost every case, the overrun traces back to four categories. Here they are, in the order they usually fail.

1. Storage tiering: the quietest and largest

The first cloud decision almost nobody revisits after launch: what tier does your data live in? S3 Standard (or its equivalents in Azure and GCP) is priced for hot data, the stuff you read often. But most of what mid-market companies store is not hot. Application logs older than 30 days. Customer records from closed accounts. Database backups nobody has ever restored. Completed export files. Processed source data.

The difference between S3 Standard and S3 Glacier Deep Archive is approximately 22x on the same volume. A company with 400 TB of effectively cold data sitting in Standard is paying about $9,200 a month. In Deep Archive, that same data is $400. Every mid-market company we've audited has somewhere between 100 TB and 2 PB of data in the wrong tier.

The fix is not complicated: lifecycle policies, a retention audit, and a one-time bulk transition. But it requires someone to be responsible for it, and in most orgs nobody is. Storage is "on." It just keeps working, so it never makes it onto a roadmap.

2. Commitment strategy: the one that requires guessing

The second most expensive decision is your use of Reserved Instances, Savings Plans, or equivalent commitment discounts. Done well, these save 30 to 55% off on-demand pricing for predictable workloads. Done poorly, they either lock you into capacity you're not using or leave money on the table for workloads that never change.

The common failure modes: committing 100% of current usage the week before a product launch that shifts the workload shape. Using 1-year Savings Plans when the workload has been stable for four years. Paying no attention at all and running a $40K/month bill entirely on-demand because nobody wanted to make the call.

The right rhythm is a quarterly FinOps review, even if it's just one engineer for two hours. Look at the last 90 days of usage, model commitment scenarios against the next 90, and adjust. Teams that do this save 20 to 35% off their compute bill without any engineering work. Teams that don't, don't.

3. Data transfer: the one that surprises you at billing time

Data transfer (sometimes called "egress") is the charge nobody budgets for because it's invisible in the architecture diagram. Any time data crosses an availability zone, a region, or the internet boundary, the cloud provider meters it. In a single-region, single-AZ architecture, egress is trivial. Once you go multi-region, or add cross-AZ redundancy, or serve content from a compute region to customers elsewhere, it adds up fast.

We audited a B2B SaaS company last year whose egress bill had grown to $31,000 a month, about 22% of their total cloud spend. The cause was a logging pipeline that had been quietly shipping every request payload from their primary region to a centralized log store in a second region, for "observability." Nobody noticed for eleven months.

The fix for most egress problems is to co-locate services that talk to each other. The prevention is a monthly egress line-item review. At 5% of total spend or less, ignore. Between 5% and 15%, investigate. Above 15%, something is wrong.

4. Environments nobody shut down

The smallest and most embarrassing category: dev, staging, demo, and test environments that keep running forever. A former engineer spun up a larger-than-production RDS instance to reproduce a customer bug, forgot about it, and left the company. That instance has now been running for fourteen months at $340/month. Multiply by every such forgotten resource and you're easily at $5,000 to $15,000/month of pure waste.

The operational answer: aggressive automation on non-prod environments. Auto-shutdown dev instances outside business hours. Tag every resource with a team and a TTL. Monthly orphaned-resource reports that get actioned. None of this is technically hard. All of it requires someone whose job it is.

The accountability gap

The common thread across all four categories is ownership. Cloud infrastructure is provisioned by engineers, billed to finance, and governed by no one. The engineer who spins up a resource has no line of sight to the bill. The finance team sees the bill but cannot read the invoice at the level of "this resource, this cost, this owner." The CIO or CFO sees a total that went up.

Companies that control their cloud bill have at minimum one of these three things: a designated FinOps owner (full- or part-time), automated tagging with strict enforcement, or a quarterly spend review that includes both engineering and finance and focuses on the four categories above, not the total bill, which is lagging information.

Where to start

If you're reading this because your bill has quietly doubled, run the Pareto audit: pull your last 90 days of spend, sort by service, and look at the top five line items. In nine out of ten mid-market audits, one of those five falls in one of the four categories above. And it is usually 60 to 80% of the overrun, not a little bit spread across everything.

The first $25,000 a year that a mid-market company spends on cloud cost discipline almost always returns $150,000 to $400,000. It is the most underrated IT investment we know of, and it is invisible to the business until someone decides it matters.

Common questions

What CFOs and CTOs ask us about cloud spend

We're already on Reserved Instances. Where else should I look first?

Storage tiering, almost certainly. RIs and Savings Plans get the most attention because the discount is named on the invoice, but the largest single category of waste in mid-market cloud bills is hot storage that should be cold. Pull a report of your S3 (or equivalent) usage by storage class and look at the age of objects in Standard. If more than 30% of your Standard volume is older than 90 days and hasn't been read in 60, you have a tiering problem worth tens of thousands of dollars a year. After tiering, look at egress as a percentage of total spend. Above 15%, something is misconfigured. Compute right-sizing comes third in our experience, not first.

How aggressive can I get with storage tiering before we feel it?

More aggressive than you think, with one rule: never tier without lifecycle policies that handle retrieval gracefully. Application logs older than 30 days can almost always go to Infrequent Access. Logs older than 90 days can go to Glacier or Deep Archive. Database backups follow the same pattern, with the caveat that your DR runbook needs to handle the retrieval time. The mistake we see is teams tiering manually, then getting paged at 2am because a customer support query needs three hours to thaw a Glacier object. Set up the lifecycle policies, document the retrieval SLAs, and tell the team. Done well, you'll cut storage spend 40% to 60% with zero operational impact.

What's the right cadence for a cloud cost review at our size?

Quarterly is the floor. Monthly is the ceiling. For most mid-market companies (50 to 500 employees, $200K to $2M monthly cloud spend), quarterly works because the underlying workload doesn't shift fast enough to need more frequent reviews. The review needs an engineer with operational context (not just a finance analyst), 90 minutes of focused time, and a fixed agenda: top 10 services by spend, change versus prior quarter, anomaly investigations, commitment coverage, and one or two opportunistic optimizations. Skip the cadence and you'll learn what you missed when the bill doubles. Run it religiously and you'll catch the next $20K/month leak in week three, not month eleven.

Signed by the Heartwood team at Seven Roots Consulting.

Published 2026-04-08

More field notes: When to delay an ERP migration · All insights · Get a brief on your cloud spend