Cloud Ops · Cost Optimization · Dashboards

Operate & Optimize

An operations toolkit that cut manual toil, surfaced wasteful infrastructure, and gave leadership a single pane of glass for cloud health and spend.

From spreadsheets to signals

Before this work, cost and reliability insights lived in ad-hoc scripts and one-off reports. Afterward, teams had a stable set of dashboards and automation they could build on.

Role

SRE / Platform Engineer · Cost & Reliability

Tech Stack

AWS, Prometheus/Grafana or CloudWatch, Cost Explorer, scripting (PowerShell, Python), GitHub Actions

Highlights

Weekly cost & reliability reviews · Automated clean-up jobs · Shared dashboards for product + engineering

Overview

The platform had grown quickly. So had the bill and the number of dashboards. Teams lacked a shared view of what was healthy and what was waste.

We set out to design a small, opinionated set of tools and views: enough to drive the right conversations, not a forest of charts.

Toolkit architecture

We grouped the work into three pillars:

Cost: tags, budgets, and scheduled reports filtered by owner and environment.
Reliability: SLO-style dashboards and alerting tuned to user experience, not just CPU.
Toil reduction: small automation jobs to stop repetitive tasks stealing engineer time.

Sample optimization job

One simple but high-impact task was a nightly script that identified idle resources and opened tickets automatically. Conceptually, it looked like this:

Example clean-up flow

1. Pull EC2 / RDS / volumes with low CPU + no recent connections
2. Cross-check with tagging to find owner and environment
3. Create a ticket or chat message with a one-click "approve shutdown" link
4. After grace period, stop or downsize resources automatically

Impact

Within a few cycles, the organisation had a regular “Operate & Optimize” rhythm: a short weekly review grounded in the same dashboards and automated summaries.

Toil dropped, surprise bills shrank, and teams could clearly explain how platform changes affected both reliability and spend.