The Consulting Mindset — Phase 3 Reference

Five frameworks. Worked examples from real AI/data deployments. Copy-paste templates.

MECE — Mutually Exclusive, Collectively Exhaustive

MECE is how you turn a vague mess of work into a structured plan where every task belongs in exactly one bucket, and all buckets together cover the entire problem. It was popularized by McKinsey, but its value for data/AI work is concrete: a MECE breakdown prevents both duplicated effort (two engineers building the same thing from different angles) and invisible gaps (entire categories of work no one owns).

The Test

Mutually Exclusive: Could a single task live in two buckets? If yes, your categories overlap. Overlap creates confusion about ownership and scope.

Collectively Exhaustive: If you completed every task in every bucket, would the entire problem be solved? If not, something is missing from your breakdown.


Worked Example — "Build an AI agent for our ops team"

A client CTO says: "We want AI to make our operations team more efficient." That statement is not a project plan. A non-MECE engineer turns it into a list of tasks that overlap and leave gaps. A MECE breakdown turns it into a tree where every leaf is actionable and ownable.

Non-MECE Overlapping, ungrouped, gaps invisible
Tasks someone might write down:
• Connect to the database
• Build the chat interface
• Write prompts
• Handle user authentication
• Test the AI responses
• Set up BigQuery
• Deploy to GKE
• Train the model
• Write documentation

Problems: "Connect to the database" and "Set up BigQuery" overlap. "Monitoring" and "alerting" are missing entirely. "Train the model" is probably wrong (you're fine-tuning, not training). No one knows which tasks are blocked by which others.
MECE Four exclusive buckets, nothing missing
AI Ops Agent — complete problem decomposition
1. Data Layer GCS Bronze ingestion · BQ Silver/Gold models · Pub/Sub streaming · dbt pipeline · data quality tests
2. Agent Layer ADK agent definition · tools (BQ query, GCS read) · prompts · RAG pipeline · Vertex AI Search grounding
3. Platform Layer GKE deployment · IAM + Workload Identity · VPC networking · CI/CD pipeline · Artifact Registry
4. Validation Layer Eval harness (golden dataset) · AutoSxS pairwise eval · Vertex AI monitoring · UAT with 5 ops users · Day 2 runbook

Why this works: Each task belongs in exactly one bucket. A new task that arrives mid-project slots unambiguously into one category — you immediately know who owns it and whether it was already scoped. The four buckets together cover the entire problem: no data, no agent. No platform, no deployment. No validation, no client sign-off.

MECE for Scope Creep Defense

MECE isn't just a planning tool — it's a scope management tool. When a client says "can you also add a Slack integration?", your MECE breakdown immediately answers the question: which bucket does this belong to, and was it in the original decomposition? If it's not in any bucket, it's new scope. If it is in a bucket, check whether the original estimate for that bucket accounted for it.

The Scope Creep Script

"Great idea. That would live in our Platform Layer. Looking at our current scope for that bucket, we have [X, Y, Z] committed. A Slack integration would add approximately [N] days. Should we deprioritize something in the existing scope to fit it in, or add it to the backlog for Phase 2?"

This response is non-defensive, constructive, and forces the client to make a tradeoff decision rather than just adding work to your plate.


The 80/20 MECE Prioritization

Once you have a MECE breakdown, the next question is ordering. The FDE doc calls this 80/20 Value Scoping: identify the 20% of tasks that will deliver 80% of the client's value, and build those first. The method:

Score each task on two axes

Value: How much does this move the client's success metric? (High / Medium / Low)

Effort: How long does this take? (High / Medium / Low)

Build High Value + Low Effort tasks first. These are your quick wins. They build trust and buy time for the harder work.

Gold-plating: the FDE failure mode

"Gold-plating" means building features nobody asked for because they're technically interesting. A MECE breakdown with 80/20 scoring prevents this: if "real-time streaming dashboard" is Low Value to the client but High Effort for you, it drops to the bottom of the list — regardless of how cool it is to build.

The Pyramid Principle

The Pyramid Principle is a communication framework from Barbara Minto (McKinsey). The core idea is simple and counterintuitive: lead with the conclusion, then support it with evidence. Most engineers naturally do the opposite — they build up context, explain their methodology, describe their analysis, and arrive at the conclusion at the end. Executives experience this as frustrating and unclear.

The reason it's called the Pyramid: the conclusion sits at the apex. Below it are the 2-4 key supporting arguments. Below each argument are the detailed facts and data. You present top-down. You justify bottom-up.


The Structure

Governing Thought (the answer)
Lead with this. Always.
Key Line (2-4 supporting arguments)
The "why" behind the answer
Supporting detail / data / evidence for each argument
Only if they ask. Don't volunteer.
BLUF — Bottom Line Up Front

The military variant of the same principle. Every memo, email, and briefing starts with the conclusion: "We recommend X. Here's why." Not: "We analyzed A, B, and C, and after considering D and E, arrived at the conclusion that X may be appropriate."

Apply BLUF to every Slack message, every status email, every discovery readout. The reader immediately knows the conclusion. If they want detail, they read on. If they're busy, they already have what they need.


Worked Example — The Same Message, Two Ways

Scenario: Your BigQuery pipeline is running 6 hours late. You need to tell the client CTO.

Engineer's natural instinct Bottom-up. Confusing. No clear ask.
"So we were running the dbt transform last night and noticed that the Pub/Sub subscription had a backlog building up from around 2am. We looked into it and it seems like the upstream CRM export changed its schema — there's a new field called `order_category_v2` that broke our Bronze model validation. We've been debugging since 6am and we think we have a fix but we need to test it first, and also we might need you to check with the CRM team about whether this schema change was intentional. The dashboard is currently showing yesterday's data."

The CTO heard: lots of words, something broke, unclear if they need to do anything, unclear when it's fixed, unclear what the business impact is.
Pyramid / BLUF Conclusion first. Clear ask. Confidence.
Subject: Dashboard delay — 6 hours — fix deploying at 2pm | Action needed

"The analytics dashboard is showing yesterday's data due to an upstream schema change in your CRM export. We'll have it current by 2pm today.

Three things to know:
1. Root cause: A new CRM field (`order_category_v2`) broke our ingestion validation at 2am — expected behavior; our circuit breakers caught it before bad data reached the dashboard.
2. Fix: Schema update is tested and deploying now. Dashboard will refresh at 2pm.
3. Action needed: Can you confirm with the CRM team whether `order_category_v2` is permanent? If yes, we'll add it to our Silver model this week."

What changed: Conclusion in subject line. 3 numbered points (not a paragraph). Explicit ask at the end. The CTO reads this in 20 seconds and knows exactly what happened, when it's fixed, and what they need to do.

The Pyramid for Technical Recommendations

When recommending an architecture decision to a technical stakeholder:

Architecture recommendation — Pyramid structure
template
# GOVERNING THOUGHT (lead with this — one sentence) We should use Pub/Sub + Cloud Run instead of Dataflow for this pipeline. # KEY LINE (2-4 supporting arguments — parallel structure) 1. Cost: Pub/Sub + Cloud Run is ~$200/month at our volume vs $1,400 for Dataflow 2. Complexity: Dataflow requires managing Apache Beam, increasing on-call burden 3. Fit: Our latency requirement (5 min) doesn't justify Dataflow's streaming capabilities # SUPPORTING DETAIL (only if asked) — Volume analysis: 50K events/day × 365 = 18M events/year → 18 Cloud Run invocations/min — Dataflow minimum: 1 worker × $0.048/vCPU-hr = $34/day = $1,240/year — Cloud Run: 18M invocations × $0.0000004 = $7.20/month + CPU time ≈ $180/year — Beam complexity: requires Java/Python pipeline code, job monitoring, autoscaling tuning # PRE-ANSWER OBJECTIONS "But what if volume grows 10x?" → Cloud Run scales automatically; reassess at 500K events/day "What about exactly-once delivery?" → Pub/Sub + idempotent subscriber achieves this

Pyramid for Discovery Readouts

After your Week 1 site survey, you present findings to the client. This is the highest-stakes Pyramid use case: the client is deciding whether to continue the engagement, expand scope, or both.

Discovery readout — Pyramid structure (15-min slot)
presentation template
Slide 1 — The Answer (spend 2 minutes here) "Your biggest risk is not the AI — it's the data. We can deploy the agent in 6 weeks, but data quality issues will prevent it from being useful until Week 4. Here's our recommended path." Slide 2 — Three Supporting Arguments (spend 8 minutes here) 1. Data Risk: 40% of records in the CRM export are missing timestamps → Impact: Silver deduplication will fail; Gold will have stale records → Fix: 2-day cleanup sprint with the CRM team (we'll provide the queries) 2. Architecture Gap: No streaming pipeline exists; all data is batch → Impact: 24-hour dashboard lag vs the "real-time" requirement in the SOW → Fix: Pub/Sub + Cloud Run adds 1 week but satisfies the requirement 3. Organizational Risk: No internal "Run Team" identified for Day 2 → Impact: Agent becomes shelfware when we leave → Fix: Identify one internal owner by Week 2; we'll train them by Week 5 Slide 3 — The Ask (spend 5 minutes here) Three decisions needed from this room today: 1. Approve the 2-day CRM data cleanup (blocks everything else) 2. Confirm streaming vs batch requirement (impacts cost by $800/month) 3. Name the internal Run Team owner by Friday Note: slides 4-10 are appendix — present only if asked. They contain the technical detail. Most executives never ask to see them. The fact that they exist is enough to establish credibility.

SOW + Minimum Viable Architecture

The Statement of Work is the legally binding fence around your project. The FDE doc defines it simply: "If it's not in the SOW, it's scope creep." But a SOW is only as useful as its precision. A vague SOW ("build an AI agent for operations") is worse than no SOW — it creates false confidence while leaving every disputed boundary unresolved.

The Minimum Viable Architecture (MVA) is the technical complement to the SOW. It answers: what is the simplest possible system that proves value within 30 days? The MVA becomes the deliverable definition in the SOW.


The SOW Anatomy

What goes IN the SOW

Specific deliverables with measurable acceptance criteria. Named integrations. Explicit performance targets (latency, accuracy, uptime). Timeline with milestones. Environments covered. Team responsibilities.

What goes OUT of the SOW

Everything else. "Future phases." Verbal commitments made in meetings. Feature requests added after signing. Systems not explicitly named. Performance targets not explicitly stated. If it wasn't written and signed, it doesn't exist.

SOW — production template with FDE-specific clauses
legal template · adapt with counsel
STATEMENT OF WORK Project: [Project Name] Client: [Client Organization] FDE: [Your Name / Company] Date: [Date] Version: 1.0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1. OBJECTIVE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Enable [User Group] to [Action] by deploying a [Technology] integrated with [Named Systems]. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2. DELIVERABLES (the "fence") ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ D1: Data Pipeline — Bronze ingestion from [named source systems] — Silver cleaning (dedup, type casting, normalization) — Gold aggregates powering the agent context — dbt models with automated quality tests D2: AI Agent — Single-agent architecture using [ADK / LangChain / etc.] — Grounded on [Gold BQ tables] via Vertex AI Search — Tools: [list each tool — BQ query, GCS read, etc.] — System prompt reviewed and approved by Client D3: Deployment — Cloud Run service in [GCP project] — IAM with least-privilege service accounts — Basic monitoring dashboard (Cloud Monitoring) D4: Handover — Runbook for Day 2 operations (written documentation) — 2-hour training session with named Run Team — 30-day hypercare period (5 hrs/week support) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3. ACCEPTANCE CRITERIA (measurable success) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The project is complete when ALL of the following are met: Retrieval hit rate ≥ 90% on client-approved golden dataset (30 queries) End-to-end agent response latency < 8 seconds (p95) Zero hallucinations on golden dataset (Groundedness score ≥ 0.95) UAT sign-off from ≥ 3 of 5 named pilot users Run Team can independently restart the service per runbook (demonstrated) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4. OUT OF SCOPE ← READ THIS SECTION FIRST IN EVERY DISPUTE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The following are explicitly NOT included in this engagement: — Integration with [legacy AS400 / SAP / any unnamed system] — Fine-tuning or training of foundation models — Multi-agent or parallel agent architectures — Mobile or native application development — Data science / statistical modeling — Any work related to [system] not listed in Section 2 Any request outside this scope requires a written Change Order signed by both parties before work begins. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5. CLIENT RESPONSIBILITIES ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Client agrees to provide, by the dates shown: — GCP project with Editor access [Week 1, Day 1] — Named internal Run Team owner [Week 2, Day 1] — Golden dataset (30 labeled Q&A) [Week 3, Day 1] — 5 pilot users for UAT [Week 5, Day 1] — Firewall / VPN access for FDE [Week 1, Day 1] BLOCKER CLAUSE: If Client fails to provide any item above within 5 business days of the listed date, the timeline extends by 1 day for each day of delay. No penalties apply. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6. TIMELINE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Week 1: Site survey + data audit + access provisioning Week 2: Bronze/Silver pipeline + GCP landing zone Week 3: Gold layer + agent MVP (basic Q&A, no eval) Week 4: Eval harness + iteration + first UAT session Week 5: UAT sign-off + hardening + Run Team training Week 6: Production deployment + hypercare begins

The Minimum Viable Architecture

The MVA is the technical answer to "what's the simplest thing that proves value?" The FDE impulse is to design the full production-grade system immediately. The MVA discipline says: resist that impulse. A proof of value in 2 weeks beats a perfect architecture in 8.

MVA Design Rules

Swap managed services for custom. Use Cloud Run instead of GKE. Use Vertex AI Search instead of custom RAG. Use CSV exports instead of real-time Pub/Sub. Complexity comes in Phase 2, after you've proven value.

Hard-code the things you'll later parameterize. One client, one dataset, one agent, one tool. Generalization is a Phase 2 problem.

30-day value window. If you can't show a demo in 30 days, the MVA is too ambitious. Cut scope, not quality.

MVA vs Full Architecture

MVA: CSV → Cloud Run (Python + DuckDB) → Vertex AI Search → Cloud Run Agent → Chat UI

Phase 2: Pub/Sub → Dataflow → BigQuery (partitioned/clustered) → GKE Agent Cluster → Internal portal

The MVA is 1 week to build. Phase 2 is 4 weeks. The client sees value in Week 2. The engagement continues. That's the ROI of the MVA pattern.


Change Order — The Scope Creep Firewall

Change Order template
1-page template
CHANGE ORDER #[N] SOW Reference: [SOW date / version] Date: [Date] REQUESTED CHANGE: [One paragraph describing the new request exactly as the client stated it] IMPACT ANALYSIS: — Scope: [What new work is required, specifically] — Timeline: [+N days/weeks to delivery] — Cost: [$X additional or N additional hours at $Y/hr] — Dependencies: [What client must provide for this to proceed] TRADEOFF OPTIONS: Option A: Add this scope — +[N] days, +[$X] Option B: Defer to Phase 2 — no impact to current timeline Option C: Replace [existing deliverable] with this — no timeline impact DECISION REQUIRED BY: [Date — typically 2 business days] Approved by Client: _______________ Date: ________ Approved by FDE: _______________ Date: ________ Note: Work does not begin until both signatures are obtained. Verbal approval is not sufficient.

The Three Whys — Diagnostic Mindset

The FDE doc's version of "the Three Whys" is a structured diagnostic framework for uncovering the real business problem behind a client's stated request. A client rarely asks for what they actually need — they ask for a solution they've already imagined, based on an incomplete understanding of what's possible. The Three Whys excavate the actual problem.

The three questions are specific — not the generic "ask why five times" of lean manufacturing:

The Three FDE Whys

1. "What is the System of Record?" — Where does the ground truth actually live? If the answer is an Excel file on someone's desktop, the project is already fragile.

2. "What is the Cost of Inaction?" — If we don't build this, what happens? This defines the project's political priority and your negotiating leverage.

3. "What does Day 2 look like?" — Who maintains this when you leave? If there is no internal owner, the project will die after you leave — regardless of how good the code is.


Worked Example — "We want AI to help our analysts"

A head of analytics says: "We want to use AI to help our analysts work faster." This is a reasonable starting point. What does it actually mean?

1
What is the System of Record?
"Our analysts use a combination of Salesforce exports, internal SQL reports they run manually, and a shared Excel tracker on SharePoint."
Diagnosis: the System of Record is fragmented. Three sources with no single truth. Any AI built on top of this will produce inconsistent answers depending on which source it queries. The real work isn't building the agent — it's consolidating the sources first.
2
What is the Cost of Inaction?
"Each analyst spends about 2 hours per day pulling data from these systems and formatting it into reports. We have 12 analysts. That's 24 analyst-hours per day of manual work. At $80K average salary, that's roughly $600K/year in manual data work."
Diagnosis: this project has a $600K CoI — meaning you can justify $150K of engineering investment for a 4x ROI. This number also becomes your political weapon: when the IT department resists giving you access, you can say "this access delay is costing the organization $25K per week."
3
What does Day 2 look like?
"Honestly, we haven't thought about that. We'd probably just hand it to the IT team?"
Diagnosis: no internal owner. This is the most common reason AI projects become shelfware. The answer "hand it to IT" means no one with domain knowledge will maintain it, prompt-tune it, or update the golden dataset. You need to negotiate an internal owner before week 2 — ideally the most technical analyst on the team.
What the Three Whys produced

You started with "we want AI to help analysts." You now know:

1. The real problem is fragmented data, not missing AI. Fix the data architecture first.

2. The project has a $600K annual CoI — you can justify meaningful engineering investment and use the number to unblock political resistance.

3. The project needs an internal owner or it dies. Make this a Week 2 milestone in the SOW, not an afterthought.

None of this would have surfaced from a technical architecture discussion. It only surfaces from asking business-level questions.


The Discovery Checklist — Running the Three Whys at Scale

For a full engagement, the Three Whys expand into a pre-build discovery checklist. Never write a line of code before these are answered:

Pre-build discovery checklist
Week 1 · site survey
ADMINISTRATIVE + POLITICAL [ ] Champion: Who is the internal person fighting for this project? If no one internally wants this, it will die regardless of quality. [ ] Blocker: Which dept (IT, Legal, Compliance) is most likely to stop us? Meet them in Week 1. Don't discover their objection in Week 4. [ ] Success metric: What number moves for this to be a success? "Better" is not a metric. "40% reduction in manual lookup time" is. [ ] Executive sponsor: Who can unblock when the IT team delays our firewall request? DATA + SYSTEMS OF RECORD [ ] Named source systems: List every system whose data we'll touch. Generic ("our CRM") is not enough. Salesforce? HubSpot? Legacy Oracle? [ ] Data classification: Is any data PII, PHI, or otherwise regulated? HIPAA / GDPR / SOC2 requirements change the architecture significantly. [ ] Data volume: How many rows/GB/TB? Current and projected growth? 50GB = DuckDB. 50TB = BigQuery. The answer changes the stack. [ ] Data quality: What are the known quality issues? Every client has them. Finding out in Week 3 is expensive. [ ] Data latency: Is batch (daily) acceptable, or do we need streaming (<5min)? Streaming adds 1 week and $X/month. Get the requirement in writing. INFRASTRUCTURE [ ] Cloud access: Do we have GCP Project Editor or Owner? Without this, nothing starts. Day 1 blocker. [ ] Connectivity: Is this a private VPC? VPN? Air-gapped? Air-gapped = bring your own container registry and offline model weights. [ ] Existing infra: What's already running in the project? Don't clobber their existing systems. [ ] GPU quota: If deploying models, does the project have A100/H100 quota? Quota increases take 48-72 hours. Request on Day 1. PEOPLE [ ] Run Team owner: Who maintains this after we leave? Name. Title. Technical level. Must be identified by Week 2. [ ] UAT users: Who are the 3-5 pilot users for acceptance testing? Must be real end users, not managers. Named by Week 3. [ ] Stakeholder comms: How often does the exec sponsor want status updates? Weekly WES or biweekly? Slack or email? Right format = no surprises. RED FLAGS — escalate immediately if you see these: [ ] "Data will be ready in 2 weeks" → It never is. Build with what exists now. [ ] "We don't need a PM on our side" → The project will lose direction. [ ] "Can we just run this on-prem for now?" → Deep cloud distrust. Surface it early. [ ] No named Run Team owner by Week 2 → The project will die when you leave. [ ] "The CEO wants this by Friday" → Unrealistic deadlines produce bad systems.

Cost of Inaction — The Political Lever

How to calculate CoI

Time saved × headcount × fully-loaded cost = annual CoI

Example: 2 hrs/day × 12 analysts × $80K salary × 1.4 benefits multiplier ÷ 250 working days = $1,075/day CoI = $269K/year.

Add risk costs: "If our pipeline produces a bad report and a trader acts on it, the potential loss is $X." Now you have a ceiling for engineering investment.

Using CoI in conversations

"The IT team's 2-week delay on the firewall request is costing the organization approximately $21,500 in analyst time. I wanted to make sure you had that context before our next check-in."

Said to the right executive, this unlocks the firewall approval in 24 hours. Never say it confrontationally — frame it as "context," not accusation.

UAT — User Acceptance Testing

User Acceptance Testing is the moment of truth. The FDE doc states it plainly: if the users don't accept it, the project isn't done — regardless of how good the code is. UAT is not QA (your team testing your own work). It's not a demo (you driving the system for an audience). It's 3-5 real end users, in their actual work environment, completing real tasks without your help, and signing off that the system meets their needs.

Most engineering teams treat UAT as a formality — a box to check before deployment. FDEs treat it as the project's most important milestone, because failing UAT is the only thing that definitively stops you from getting paid.


The UAT Setup — What Most Engineers Get Wrong

UAT anti-patterns

You driving the demo. If you're clicking the buttons, you're doing a demo, not UAT. Seat the user in front of the keyboard and step back.

Managers as UAT users. Managers approve things. End users find problems. The analyst who will use this daily is your UAT user — not their director.

Perfect test data. If UAT uses clean, curated data you prepared, it will pass. The real data will fail. UAT must use real production data.

No success criteria defined in advance. If you haven't defined what "pass" looks like before UAT, the goalposts move during the session.

UAT done right

5 real users, real tasks, real data. No demo mode. No curated examples. The user tries to do their actual job with the new tool.

Pre-defined acceptance criteria. The SOW acceptance criteria are the UAT pass/fail criteria. UAT doesn't end until every criterion is met or explicitly waived.

Structured observation. You take notes. You do not explain, coach, or justify. Every point of confusion the user hits is a UX bug.

Written sign-off. Email or form, timestamped. "UAT passed" said verbally doesn't count. You need this for the contract milestone.


The UAT Scorecard

UAT scorecard — one per pilot user
template
UAT SCORECARD Project: [Project Name] Date: [Date] User: [Name / Role] FDE Observer: [Your Name] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ TASK COMPLETION (from SOW acceptance criteria) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Task 1: Query agent for [specific use case] Result: [ ] PASS [ ] FAIL [ ] PARTIAL Notes: ______________________________ Time to complete: __ min (target: <8 min) Task 2: Retrieve report from [named dataset] Result: [ ] PASS [ ] FAIL [ ] PARTIAL Notes: ______________________________ Task 3: Interpret agent response for [business decision] Result: [ ] PASS [ ] FAIL [ ] PARTIAL Notes: ______________________________ Task 4: [Client-specific task from golden dataset] Result: [ ] PASS [ ] FAIL [ ] PARTIAL Notes: ______________________________ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ QUALITY DIMENSIONS (1-5 scale) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Response accuracy: __/5 (Do answers match known correct answers?) Response latency: __/5 (Is the response time acceptable?) Ease of use: __/5 (Could user complete tasks without help?) Trust in output: __/5 (Would user act on the agent's answer?) Overall satisfaction: __/5 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ VERBATIM FEEDBACK (write what they say, not your interpretation) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ______________________________ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BLOCKING ISSUES (must fix before sign-off) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ______________________________ SIGN-OFF [ ] I accept this system for production use as described in the SOW. [ ] I accept with the following conditions: ______ [ ] I do not accept. Blocking issues listed above must be resolved first. User signature: _______________ Date: ________

The Live UAT Tracking Table

For a 5-user UAT session, track pass/fail in real time. The table below shows what a mid-UAT status looks like — three users done, two pending, one blocking issue identified.

User Task 1 Task 2 Task 3 Task 4 Avg Score Sign-off
Ana R.
Senior Analyst
PASS PASS PASS PASS 4.6 / 5 ✓ Signed
Marcus T.
Analyst II
PASS FAIL
wrong date range
PASS PASS 3.8 / 5 ✗ Blocking
Priya K.
Lead Analyst
PASS PASS PASS PASS 4.9 / 5 ✓ Signed
James O.
Analyst I
Scheduled: Tomorrow 2pm Pending
Sofia M.
Ops Manager
Scheduled: Tomorrow 3pm Pending

The blocking issue — Marcus Task 2

Marcus queried the agent for "Q1 revenue" and the agent returned Q4 of the prior year. Root cause: the agent's date parsing prompt assumed the current quarter without grounding to CURRENT_DATE. Fix: add Today is {date} to the system prompt and re-run the golden dataset eval. Estimated fix time: 2 hours. UAT resumes tomorrow at original schedule.

This is the value of UAT. This bug would have been invisible in a demo (you would have queried with an explicit date range). It was found because a real user asked a natural language question the way they actually think about their work.


UAT → Day 2 Transition

UAT sign-off is not the end — it's the beginning of the Day 2 operations period. The FDE doc defines Day 2 as everything that happens after you leave: monitoring, retraining models as data drifts, training the internal Run Team, and handling incidents without you.

Day 2 handover checklist
Week 5-6
DOCUMENTATION (must exist before you leave) [ ] Runbook: how to restart the agent service [ ] Runbook: how to update the golden dataset for re-evaluation [ ] Runbook: how to add a new data source to Bronze [ ] Architecture diagram: current state (not aspirational) [ ] Incident playbook: what to do when the dashboard is stale [ ] Access inventory: who has access to what and via which SA RUN TEAM TRAINING (must be demonstrated, not just documented) [ ] Run Team owner can restart Cloud Run service from GCP Console [ ] Run Team owner can run dbt manually for a backfill [ ] Run Team owner can read Cloud Logging to diagnose an issue [ ] Run Team owner knows who to contact (and how) if they're stuck [ ] Run Team owner has received Slack/PagerDuty alert credentials MONITORING (must be live before hypercare ends) [ ] Cloud Monitoring dashboard for agent latency + error rate [ ] dbt source freshness check running on schedule [ ] Circuit breaker alerts → Slack channel (not just logs) [ ] Weekly BQ cost report (INFORMATION_SCHEMA query on schedule) [ ] Vertex AI prediction drift monitoring (if applicable) HYPERCARE PERIOD (30 days post-launch) [ ] 5 hours/week FDE availability agreed and in calendar [ ] Weekly WES continuing through hypercare [ ] Escalation path: how does client reach you for P1 incidents? [ ] Hypercare end date: [Date] After this date: support is via standard SLA, not hypercare THE OBSOLESCENCE TEST: "Could the Run Team handle the 5 most likely incidents without calling me?" If yes → Day 2 is set up correctly. You've done your job. If no → More training or documentation needed before hypercare ends. The FDE's goal is to become obsolete at the client site — because the system you built is so good, it runs itself.
The Obsolescence Principle

The FDE doc's final line: "The FDE's goal is to become obsolete at a client site — because the system you built is so good, it runs itself."

This is the counterintuitive success condition of consulting work. A consultant who makes themselves indispensable has failed. A system that requires the original FDE to operate every time is a liability to the client, not an asset. The evidence of a successful engagement is that the client team runs it confidently without you — and calls you back for the next engagement because of how well the first one went.