Phase 1 of 6
Scoping & Latency Constraints
Define the channels, time-to-first-token budget, language coverage, PII surface, and regulatory footprint that will govern every architectural decision for the conversational AI stack.
0/8
Phase Progress
Required Recommended Optional Open-Source Proprietary Trinidy
Channels & Interaction Surface
Identify channels the assistant must serve
Why This Matters
Channel selection materially changes the latency envelope, the PII surface, and the hallucination risk profile. Voice channels demand sub-800ms time-to-first-token to preserve conversational cadence, while in-app chat tolerates 1–2 seconds before users perceive lag. Bank of America's Erica runs across mobile, voice, and web with a 48-second average interaction — a design point only reachable when channel-specific budgets are set explicitly. The most common mistake is treating every channel as a single assistant; the latency and compliance envelope differ by channel, not by use case.
Note prompts — click to add
+ Which channels share enough context to justify a single assistant versus channel-specific tuning?+ Have we inventoried the p95 latency budget for each channel before selecting a model?+ Who owns the channel-by-channel handoff to a human agent when the assistant cannot resolve?Confirm every surface on which the chatbot or virtual assistant will answer customer queries.
Select all that apply
Define time-to-first-token (TTFT) SLA by channel
Why This Matters
Perceived latency in conversational AI is dominated by time-to-first-token, not end-to-end generation time — users judge the assistant as "fast" based on how quickly it starts responding. TensorRT-LLM on H100 delivers ~100ms TTFT at 64 concurrent requests, and vLLM v0.6.0 cut TTFT by 5× versus prior releases — the serving substrate now matters as much as the model size. A cache-hit path through semantic caching returns in 5–20ms, while a cache miss requires the full LLM inference — so the effective TTFT is the blended average and hinges on cache hit rate.
Note prompts — click to add
+ What is our current p95 TTFT by channel, and where is the hot spot — retrieval, serving, or network?+ Have we measured TTFT separately for cache-hit versus cache-miss paths?+ What does the assistant do when TTFT is breached — keep streaming, time out, or hand off?Select the TTFT budget your conversational stack must hold at p95 under peak load.
Single choice
Trinidy — Cloud-routed LLM inference consumes 100–300ms of network round-trip before a single token is produced — often half of the perceived latency budget. Trinidy collocates the serving tier with the semantic cache and RAG retriever, keeping TTFT predictable even during traffic spikes.
Define end-to-end response completion SLA
Specify the p95 full-response latency target distinct from TTFT.
Single choice
Specify language and dialect coverage
Why This Matters
Wells Fargo has publicly reported that Spanish accounts for more than 80% of Fargo's non-English usage — language coverage is not a nice-to-have, it is a primary product decision. Hallucination rates and safety-tuning quality differ materially across languages in frontier models, and many guardrail evaluations are English-only. Shipping a chatbot that is fluent in English and unreliable in Spanish creates a measurable fair-lending exposure in addition to a CX problem.
Note prompts — click to add
+ What is our customer base's language distribution, and does our assistant match it?+ Do we evaluate hallucination and guardrail performance per language, or only in English?+ Is our RAG corpus available in every supported language, or is non-English a translation-only surface?Confirm language support, with particular attention to Spanish and other high-volume non-English segments.
Select all that apply
Map the PII surface entering LLM context
Why This Matters
The CFPB June 2023 Issue Spotlight specifically named account numbers, transaction histories, SSNs, beneficiary designations, and health-related financial data (HSAs/FSAs) as categories of sensitive PII that financial chatbots routinely put into LLM context — every one of which triggers GLBA Safeguards Rule and CFPA obligations. Cloud-routed LLM inference means that context becomes a third-party processor relationship, not just an engineering choice. Wells Fargo's solution — voice input locally transcribed, SLM scrubs PII, only anonymized text reaches the external LLM — is the architectural pattern that allowed 11.5× interaction growth without compliance exposure.
Note prompts — click to add
+ Which PII categories reach our LLM context today, and which reach a third-party LLM provider?+ Do we have a local PII scrubbing / anonymization layer, or does raw customer text hit the LLM directly?+ Have we mapped the LLM provider relationship against GLBA service-provider requirements?Inventory every PII category that may be placed into the LLM context window.
Select all that apply
Confirm data residency and cross-border constraints
Map conversational context and retrieval corpora to jurisdictional constraints before architecture is finalized.
Select all that apply
Trinidy — GLBA Safeguards, CCPA/CPRA, BIPA (for voice), and EU GDPR all press against cloud-hosted LLM serving. Trinidy keeps the RAG index, PII scrubbing, and audit logging entirely within the institution's perimeter — no cross-border flow of customer dialogue for any interaction.
Define scope of consumer-facing statutory rights handling
Why This Matters
The CFPB August 2024 guidance explicitly stated that AI chatbot errors that fail to recognize a consumer's invocation of statutory rights — Reg E dispute notices being the canonical example — may constitute UDAAP violations, with no "AI error" defense available. Reg E starts a regulatory clock (10 business days to investigate, 45 days to resolve); a chatbot that confidently answers a dispute question without triggering the Reg E process has created a regulatory liability, not an ops issue. Statutory-rights recognition must be a first-class routing decision, not a side effect of intent classification.
Note prompts — click to add
+ Does our intent classifier have first-class intents for every protected statutory notice a consumer might give?+ When a statutory right is invoked, does the assistant hand off to a compliant workflow rather than attempting to answer?+ Do we log the moment a statutory-rights intent was detected for regulatory audit?Specify how the assistant recognizes and routes consumer invocations of statutory rights (Reg E dispute, Reg Z billing error, FCRA, etc).
Select all that apply
Specify deployment topology for the serving plane
Select the physical/logical deployment target for the LLM serving tier and the RAG retriever.
Single choice
Trinidy — For PII residency and sub-second TTFT, cloud-API-only serving is economically and regulatorily fragile at Wells Fargo / BofA scale. Trinidy provides on-prem vLLM / TensorRT-LLM serving with the semantic cache and RAG index collocated — the entire hot path stays inside the institution's perimeter.