Phase 1 of 6
Scoping & Autonomy / Rollback Constraints
Define the fault domains in scope, the remediation latency budget, the autonomy level you are prepared to operate at, and the rollback guarantees every downstream architectural decision must respect.
0/9
Phase Progress
Required Recommended Optional Open-Source Proprietary Trinidy
Network Domains & Fault Surface
Identify network domains in scope for fault prediction and self-healing
Why This Matters
Fault signatures and remediation primitives differ sharply across RAN, transport, core, and site-power domains, and a single closed-loop model rarely transfers cleanly between them. O-RAN disaggregation in particular introduces RU/DU/CU fault modes that classical vendor-integrated SON models were never trained on. Inventorying every domain up front prevents the common failure mode of shipping a RAN-only self-healing model and having ops discover six months later that 40% of real outages originate in transport or power.
Note prompts — click to add
+ Which domains share enough telemetry and action primitives to justify a shared model versus dedicated per-domain heads?+ Have we inventoried every fault surface the NOC currently touches, including site power and HVAC?+ Who owns the fault-to-domain attribution so we can measure model ROI per domain?
Required
Confirm which domains the closed-loop model must observe and remediate.
Select all that apply
Radio Access Network — 4G LTE eNodeB
Radio Access Network — 5G NR gNodeB (SA / NSA)
O-RAN disaggregated RU / DU / CU
Transport / backhaul / fronthaul (IP, microwave, fiber)
Baseband / BBU pool
Site power, rectifiers, batteries, HVAC
Core network (5GC / EPC) fault surfaces
Tower-level environmental and structural telemetry
required
✓ saved
Define end-to-end remediation latency budget
Why This Matters
The O-RAN Alliance defines three control-loop tiers — Non-RT RIC (>1s), Near-RT RIC (10ms–1s), and real-time RAN control (<10ms) — and placing a closed-loop remediation on the wrong tier is the most common architecture error. A sub-second remediation pipeline running on a Non-RT RIC cadence will consistently miss the cascade window, while pushing a Non-RT workload into the Near-RT RIC bus wastes its xApp budget. Latency decisions made after the pipeline is wired have 10× less leverage than latency decisions made before site topology is chosen.
Note prompts — click to add
+ Which fault classes actually need <500ms remediation versus a 10-second Near-RT RIC loop?+ Have we measured our current detection-to-action latency end to end, including NOC ticketing overhead?+ What is our fallback behavior when the loop breaches its latency budget — no-op, escalate, or revert?
Required
Select the target latency for the full detect → classify → act → verify loop.
Single choice
< 100ms (URLLC / 5G MEC adjacent faults)
< 500ms (site-level self-healing target)
< 5s (near-real-time RIC control loop)
< 60s (Non-RT RIC / SMO control loop)
Tiered by severity (mixed SLA)
requirededgetrinidy
TrinidyCloud-routed fault inference alone consumes 50–200ms of network round-trip before a score is computed — often past the point where the fault has already cascaded. Trinidy runs the full three-stage pipeline on-node with sub-500ms end-to-end remediation, surviving backhaul degradation.
✓ saved
Select target autonomy level on the TMF L0–L5 scale
Why This Matters
TMF IG1230 defines the autonomous-network maturity levels L0–L5, and most operators in production today sit between L2 and L3 — the model triages, proposes, and sometimes executes pre-approved runbooks. Overclaiming autonomy is a well-documented failure mode: a team targets L4 on paper, deploys at L2 in reality, and operates with no clear contract for what the model is actually allowed to do alone. The level you target should be driven by the regulatory, safety, and rollback posture, not by ML ambition.
Note prompts — click to add
+ What TMF autonomy level is each of our action classes actually operating at today, honestly measured?+ What is the gap between our stated aspiration (often L4) and our contractual remediation scope?+ Who signs off on raising the autonomy level for a given action class — network engineering, SRE, or compliance?
Required
Pick the autonomous-network maturity level that governs which decisions the model may make unsupervised.
Single choice
L1 — Assisted operations (human in every decision)
L2 — Partial autonomy (human approves remediations)
L3 — Conditional autonomy (human in loop for exceptions)
L4 — High autonomy (human on escalation only)
L5 — Full autonomy (aspirational / research only)
required
✓ saved
Define acceptable auto-remediation rate and escalation rate
Why This Matters
Well-tuned closed-loop SON systems (Nokia AVA, Ericsson Intelligent Automation Platform) typically land in the 60–80% auto-remediation band for routine fault classes, with 20–40% escalating to NOC. Under-automating leaves the NOC labor savings on the table; over-automating takes the human out of the loop for action classes where a bad decision can cascade across many cells. Framing the program around an auto-remediation budget per action class rather than a single number is the most direct way to match risk appetite to ML deployment.
Note prompts — click to add
+ What percentage of routine fault tickets last quarter could have been auto-remediated safely in hindsight?+ Is our auto-remediation rate tracked per action class, or as a single aggregate?+ Who owns the P&L line for NOC labor saved versus SLA breach cost from a bad auto-action?
Required
Quantify how much of routine fault volume the model is permitted to close without a human.
Single choice
< 40% auto-remediate (conservative — mostly escalation)
40% – 60% auto-remediate
60% – 80% auto-remediate (typical well-tuned SON)
> 80% auto-remediate (aggressive L3+ deployment)
Not currently budgeted at the action-class level
required
✓ saved
Establish MTTR reduction target versus current baseline
Why This Matters
Nokia Networks has publicly reported MTTR reductions from a 4.2-hour baseline to under 4 minutes on AVA closed-loop SON, and T-Mobile US reported a 70% reduction in customer-impacting events across six consecutive quarters on Ericsson Intelligent Automation Platform. These are public benchmarks, not ceilings — but they are the honest goalposts any program should calibrate against. An MTTR target without a current-baseline measurement is a slogan, not a commitment.
Note prompts — click to add
+ What is our current p50 and p95 MTTR by fault class, and how was it measured?+ Which peer operator benchmark (Nokia AVA, Ericsson IAP, Huawei iMaster) is most comparable to our fleet?+ Do we have a per-fault-class MTTR dashboard the board sees, or only an aggregate ops metric?
Required
Specify the MTTR improvement the program commits to, benchmarked against today.
Single choice
50% MTTR reduction target
75% MTTR reduction target
90%+ MTTR reduction (Nokia AVA-class deployment)
No hard MTTR target — measure and improve
Not yet measured at the fault-class level
required
✓ saved
Map FCC NORS / DIRS outage-reporting obligations into the remediation flow
Why This Matters
FCC Part 4 (Network Outage Reporting System) requires US communications providers to report outages that last at least 30 minutes and affect 900,000 or more user-minutes, plus special-office and 911 outages on shorter thresholds; FCC DIRS is the separate disaster-information reporting regime activated during hurricanes and major events. A closed-loop system that auto-remediates can mask a reportable outage if the logging and duration accounting is not wired through the remediation flow. Regulators have been explicit that automation does not dissolve reporting obligations.
Note prompts — click to add
+ Does our closed-loop logging capture the raw outage duration even when remediation collapses it to seconds?+ Which auto-action classes could, if they failed silently, conceal a NORS-reportable outage?+ Who on the regulatory team has signed off on our NORS flow interaction with auto-remediation?
Required
Confirm which auto-remediation outcomes trigger FCC Part 4 outage reporting and ensure the model flow respects the obligation.
Select all that apply
FCC NORS — 30-minute outages affecting 900k+ user-minutes
FCC NORS — airport / 911 / special-office outages
FCC DIRS — active hurricane / disaster reporting
State PUC outage reporting overlay
International regulatory outage reporting (CRTC, Ofcom, BNetzA, ACMA)
No reportable outages in scope
required
✓ saved
Define rollback guarantee for every auto-remediation class
Why This Matters
ETSI GS ZSM 002 and the O-RAN WG2 Non-RT RIC architecture both treat reversibility as a first-class property of closed-loop operations: every automated action must have a defined, tested rollback path. A program that can auto-apply but cannot auto-revert effectively concentrates tail risk — one bad decision becomes a multi-hour incident because the reversal path is undefined. The time-to-revert is usually a more honest proxy for deployment maturity than the auto-remediation rate.
Note prompts — click to add
+ For every action class we have in production, is there a tested rollback with a measured p95 revert time?+ What fraction of our action classes have no rollback path today, and why?+ Has rollback been drilled under network stress, or only in clean lab conditions?
Required
Specify the maximum time to revert an auto-action and the conditions that force reversion.
Single choice
< 1s rollback (atomic config revert)
< 10s rollback (Near-RT RIC action revert)
< 60s rollback (Non-RT RIC / SMO-mediated)
< 5min rollback (NOC-assisted)
Rollback is best-effort / not guaranteed
requiredtrinidy
TrinidyRollback must run inside the same on-node control loop that applied the action — a cloud-routed rollback inherits the original latency problem in reverse. Trinidy keeps the forward action and its reversal on the same site-resident runtime.
✓ saved
Confirm deployment topology for the inference plane
Required
Select the physical and logical deployment target for the closed-loop pipeline.
Single choice
Site-resident edge (cell-site router / DU sleeve)
Regional aggregation point (metro / MEC)
Central Non-RT RIC / SMO cluster
Operator private cloud / VPC in-region
Public cloud managed inference
Hybrid — site-edge inference + central training
requirededgetrinidy
TrinidyFor sub-500ms remediation with backhaul-tolerant survivability, cloud inference is physically incompatible. Trinidy is the on-site inference substrate — site-resident for RAN fault classes, regional-aggregation-resident for cross-site correlation, both on the same deployment fabric.
✓ saved
Confirm data sovereignty and residency constraints for telemetry
Required
Map equipment, subscriber-adjacent, and configuration telemetry to jurisdictional residency requirements.
Select all that apply
EU GDPR — telemetry must remain in EU
UK GDPR — UK residency required
National lawful-intercept data cannot leave country
India / Brazil / China localization rules
Equipment-vendor telemetry-sharing contract limits
No cross-border data flow permitted for any CP/UP telemetry
Cross-border permitted under SCCs / approved vendors
requiredtrinidy
TrinidyEU GDPR, country-level lawful-intercept rules, and operator-specific equipment telemetry contracts all constrain cloud-hosted inference. Trinidy keeps telemetry, model scoring, and audit logging entirely within the operator's own perimeter — no cross-border data flow for any fault decision.
✓ saved