Phase 1 of 6
Scoping & Latency
Define the channels, time-to-first-token budget, seat topology, and language coverage that will govern every RAG and LLM decision downstream.
0/8
Phase Progress
Required Recommended Optional Open-Source Proprietary Trinidy
Channels & Conversation Surface
Identify channels in scope for agent assist
Why This Matters
Voice, chat, and video have materially different latency envelopes, transcription dependencies, and compliance obligations — voice triggers state two-party consent laws and MiFID II recordkeeping, while chat is text-native and avoids voice biometric exposure (Illinois BIPA). Bundling channels into one copilot without scoping each explicitly is how teams discover six months in that their voice pipeline cannot reuse the chat RAG stack because ASR latency alone consumes the entire TTFT budget.
Note prompts — click to add
+ Which channels are live today vs. planned in the next 12 months?+ Do we own the transcription stack end-to-end or is it outsourced to the CCaaS vendor?+ Is our video banking channel in scope for the same knowledge base as voice?Confirm which customer-facing channels the copilot must support.
Select all that apply
Define time-to-first-token (TTFT) latency budget
Why This Matters
Sub-500ms TTFT is the threshold at which a suggestion lands before the agent finishes the customer's sentence — anything slower forces the agent to wait through an awkward pause or talk over a stale suggestion that no longer matches the conversation turn. Industry telemetry from 2024–2025 shows 68% of financial services agent-assist deployments stuck above a 2-second P95, which is why most of them are used for after-call wrap rather than in-conversation. Setting the TTFT budget is a first-order architectural decision — infrastructure choices made after the SLA is fixed have an order of magnitude less leverage than choices that set it correctly.
Note prompts — click to add
+ What is our current P95 TTFT in our pilot deployment, and which stage dominates?+ Have we measured agent abandonment of suggestions as a function of TTFT?+ Is our TTFT target the same across voice, chat, and video, or tiered by channel?Select the P95 TTFT the agent copilot must hold during a live conversation.
Single choice
Trinidy — Trinidy's optimized RAG pipeline — embedding 20ms, ANN retrieval 80ms, rerank 50ms, prompt build 50ms, first token 250ms — completes in under 450ms on-prem. Cloud-routed LLM calls alone consume 200–800ms of network and queue time before the model starts generating.
Quantify daily call volume and concurrency
Why This Matters
Daily call volume misleads capacity planning — peak concurrent conversations determine the GPU fleet because each active agent holds a streaming LLM session. Bank of America's Erica handled 676M interactions in 2024 with concurrency peaks far above the daily-average implied load. Sizing the fleet to the daily average produces a queue during peak hour that blows through the TTFT budget well before any model is at fault.
Note prompts — click to add
+ What is our peak concurrent call count today and how does it scale in the next 24 months?+ Are we sizing the LLM fleet on daily volume or measured peak concurrency?+ What is our fallback behavior when concurrency exceeds provisioned capacity?Capacity planning anchor for the LLM serving fleet — peak concurrency drives GPU count, not daily volume.
Single choice
Specify concurrent agent seat count
Why This Matters
Seat count multiplied by average streaming-session duration determines the number of concurrent LLM streams the inference fleet must support — and LLM serving is concurrency-bound far more than throughput-bound. A 10,000-seat contact center at 80% utilization typically holds 4,000–6,000 simultaneous streaming LLM sessions during peak hour, which maps directly to GPU count. Undersizing for concurrency is the single most common cause of production TTFT regressions.
Note prompts — click to add
+ What is our measured peak concurrent streaming session count today?+ Is our GPU fleet sized for peak concurrency or rolling average?+ Do we have headroom for seasonal peaks (tax season, holiday retail)?Number of simultaneously active agent seats that must hold sub-500ms TTFT under peak load.
Single choice
Define language and dialect coverage
Why This Matters
Wells Fargo's Fargo upgrade to Gemini 2.0 Flash drove Spanish-version adoption to 80% — the Spanish-speaking segment is the single largest non-English bloc in US retail banking and is frequently underserved by English-first copilots that translate on the fly. Language coverage also cascades into PII masking (entity extractors are language-specific), compliance-tagged response libraries (Reg E disclosures must be delivered in the language of the conversation), and embedding models (multilingual embeddings sacrifice some retrieval accuracy vs. language-specific ones).
Note prompts — click to add
+ What percentage of our inbound volume is non-English, and is our copilot viable in those languages today?+ Do we have compliance-approved Reg E / TILA disclosures in every supported language?+ Is our PII masking model language-aware, or are we leaking PII in non-English transcripts?Which languages the copilot must support in RAG retrieval, LLM generation, and PII masking.
Select all that apply
Confirm data residency and cross-border constraints
Map conversation, customer, and 1033 open-banking data to jurisdictional constraints before architecture is finalized.
Select all that apply
Trinidy — Conversation transcripts containing SSN, account numbers, and 1033 open-banking payloads cannot transit public LLM APIs without triggering GLBA and potential CFPB scrutiny. Trinidy keeps ASR, RAG retrieval, and LLM inference entirely inside the institution's perimeter — no customer conversation data leaves the network boundary.
Define deployment topology for inference
Select the physical / logical deployment target for the LLM serving fleet.
Single choice
Trinidy — For sub-500ms TTFT plus GLBA-compliant residency, public cloud LLM APIs are physically and regulatorily marginal. Trinidy runs the full agentic RAG + LLM stack on-prem with GPU or CPU targets and deterministic egress-free inference.
Scope agent workflow integration surface
Which systems the copilot must read from and write into during a live call.
Select all that apply