Operate & Optimize
An operations toolkit that cut manual toil, surfaced wasteful infrastructure, and gave leadership a single pane of glass for cloud health and spend.
From spreadsheets to signals
Before this work, cost and reliability insights lived in ad-hoc scripts and one-off reports. Afterward, teams had a stable set of dashboards and automation they could build on.
Role
SRE / Platform Engineer · Cost & Reliability
Tech Stack
AWS, Prometheus/Grafana or CloudWatch, Cost Explorer, scripting (PowerShell, Python), GitHub Actions
Highlights
Weekly cost & reliability reviews · Automated clean-up jobs · Shared dashboards for product + engineering
Overview
The platform had grown quickly. So had the bill and the number of dashboards. Teams lacked a shared view of what was healthy and what was waste.
We set out to design a small, opinionated set of tools and views: enough to drive the right conversations, not a forest of charts.
Toolkit architecture
We grouped the work into three pillars:
- Cost: tags, budgets, and scheduled reports filtered by owner and environment.
- Reliability: SLO-style dashboards and alerting tuned to user experience, not just CPU.
- Toil reduction: small automation jobs to stop repetitive tasks stealing engineer time.
Sample optimization job
One simple but high-impact task was a nightly script that identified idle resources and opened tickets automatically. Conceptually, it looked like this:
1. Pull EC2 / RDS / volumes with low CPU + no recent connections
2. Cross-check with tagging to find owner and environment
3. Create a ticket or chat message with a one-click "approve shutdown" link
4. After grace period, stop or downsize resources automaticallyImpact
Within a few cycles, the organisation had a regular “Operate & Optimize” rhythm: a short weekly review grounded in the same dashboards and automated summaries.
Toil dropped, surprise bills shrank, and teams could clearly explain how platform changes affected both reliability and spend.
