Security Strategy

Nick Shevelyov
CEO & managing partner
Published
Jan 8, 2026
In December 2025, an inquiry into an Optus emergency-call outage tied the incident to failures during a firewall upgrade and highlighted gaps in procedures, escalation, and crisis response—plus accountability consequences at the board level. (Reuters) When a disruption impacts safety, customers, or market trust, it becomes a governance event.
At the same time, the innovation economy runs on concentrated infrastructure. The October 2025 AWS US‑EAST‑1 disruption—rooted in DNS resolution issues tied to regional DynamoDB endpoints—reminded leaders that a single control-plane failure can cascade across thousands of dependent services. (About Amazon)
The board takeaway is not “avoid change.” It’s “treat change like risk.”
Why change management is now a board topic
Most major incidents aren’t a mystery. They’re a predictable pattern:
A high-risk change lands in production without sufficient guardrails.
Signals get missed or ignored (monitoring, alerts, early warnings).
Escalations are slow because roles are unclear.
Communications lag because leadership doesn’t have a shared fact base.
That’s why outage governance is fundamentally about decision velocity: how quickly your organization can understand what is happening, decide what to do, and communicate with credibility.
The Change Risk Gate: a practical control that doesn’t slow teams down
You don’t need bureaucracy. You need a simple “risk gate” that triggers when blast radius is high.
Use this as your baseline (adapt to your environment):
Classify the change. Is it customer-facing? Does it touch identity, network routing, payments, or emergency/critical paths?
Define blast radius. What breaks if it fails? What’s the maximum plausible impact window?
Prove rollback. If you can’t roll back in minutes, you’re not ready to deploy.
Confirm rerouting and failover. If the primary path fails, what’s the alternate path and has it been tested recently?
Validate third-party responsibilities. If a contractor or vendor executes the change, who owns the decision, and what evidence will you receive?
Pre-stage observability. What specific metrics/logs confirm success? What thresholds trigger an automatic pause?
Run the “two-call test.” If this goes sideways, who gets paged first and second—and who has authority to stop the change?
This is not theoretical. Reviews of real-world failures repeatedly point to weak controls, missed warnings, and unclear escalation—exactly what a risk gate is designed to fix. (Reuters)
Don’t separate outage readiness from crisis communications
The best comms teams can’t compensate for missing facts. Your incident communications plan must be attached to your operational plan.
A board-ready comms approach is simple and disciplined:
First update (15–30 minutes): What’s impacted, what’s not, what you’re doing next, and when the next update will land.
Stabilization update: What you’ve changed to stop further harm (pause changes, fail over, isolate systems).
Customer update: Plain language, scoped impact, steps customers should take (if any), and how you’ll support them.
Post-incident memo (48–72 hours): Root cause (technical + process), what is now different, and what will be measured.
When a vendor is involved, the message should remain consistent: customers should not have to decode your supply chain. Your organization owns the customer relationship.
Board questions that separate mature operators from risky ones
If you only add five questions to your next board agenda, use these:
What are our top five “high-blast-radius” change categories?
How often do we test rollback and failover in production-like conditions?
What is our mean time to detect and mean time to stabilize for priority incidents?
Which third parties can take us down, and what obligations do they have during an incident?
Do we have a standing incident governance team that can make decisions fast (including comms)?
What to do this quarter
If you want fewer surprises without slowing delivery:
Implement a change risk gate for the top 3–5 change types.
Pre-write incident comms templates and approval paths.
Run one executive tabletop focused on “change gone wrong” (not ransomware).
Agree on two metrics the board will track quarterly (e.g., rollback success rate; time to stabilize).
If your board is asking whether you’re outage-ready (including cloud dependencies and vendor-executed changes), vCSO.ai can run a Board Outage Readiness Diagnostic and deliver a decision memo with a 90-day control plan.

Nick Shevelyov
CEO & managing partner



