THE FDE PRIME DIRECTIVE
The Forward Deployment Engineer is the Technical Special Forces operator who bridges perfect engineering from HQ with the hostile, legacy, politically-charged reality of enterprise client sites.
You are not a consultant. You are not a pure SWE. You are an embedded operative who writes code in the morning, manages a CTO's anxiety in the afternoon, and architects a migration strategy before dinner.
| Role | Standard SWE | FDE (You) |
|---|---|---|
| Users | Millions of anonymous | High-stakes stakeholders |
| Environment | Controlled cloud | Hostile, air-gapped, hybrid |
| Goal | Scale + stability | Speed-to-value + survival |
| Code Ratio | 90% features | 50% glue + 50% strategy |
| Secret Weapon | Design patterns | Pattern recognition across decades |
THE GREYBEARD'S STACK
Skill depth earned through actual production scars, not tutorials:
LANGUAGE ARSENAL
WHAT IS THE DELTA?
PRODUCT REALITY
What the product does in a clean demo environment with perfectly formatted JSON, stable network, and compliant data.
CLIENT REALITY
Air-gapped servers, 30-year-old schemas, corrupted CSVs, political resistance from IT, and a budget that got cut 20% last week.
THE TECHNOLOGY SURVIVOR'S TIMELINE
You've outlived every "paradigm shift." Here's what that actually means for today.
THE POLYGLOT ADVANTAGE
30 years of language acquisition means you don't just write code — you think in multiple paradigms simultaneously. This is the FDE's cognitive superpower that no bootcamp grad can replicate.
IMPERATIVE BRAIN
C / Go / Rust
You think in memory, cycles, and system calls. When the LLM-built pipeline OOMs on the client's 64GB server at 3am, you know exactly where to look because you've debugged segfaults by hand.
FUNCTIONAL BRAIN
Haskell / Scala / Clojure philosophy
Immutability, pure functions, composition. When you architect a data pipeline, you naturally reach for the patterns that prevent state bugs before they can exist.
DECLARATIVE BRAIN
SQL / Terraform / HCL / YAML
The ability to describe what you want rather than how to get it. This is infrastructure as poetry. You understand why Terraform is powerful precisely because you remember configuring servers by hand.
DYNAMIC BRAIN
Python / JavaScript / Ruby
Speed of thought. Prototyping. Glue code. When a client needs a proof-of-concept in 2 hours, you don't argue about types — you ship something that works and proves the value.
SHELL/BASH BRAIN
Bash / awk / sed / grep
The language of the machine's own nervous system. One-liners that process 10GB log files. The ability to operate in a client environment where nothing is installed except the OS itself.
LEGACY BRAIN
COBOL / FORTRAN / PL/SQL
The greybeard's secret weapon. When the client's "Data Warehouse" is actually a COBOL batch job from 1987 that processes 4 trillion dollars a day, you are the only person in the room who can read it.
POLYGLOT PATTERN MAPPING
How the same architectural concept appears across the stack — knowing all of these makes you a force multiplier:
| Concept | Low Level | Systems | Data | AI Layer |
|---|---|---|---|---|
| Message Passing | Unix pipes | Kafka / Pub/Sub | Spark RDDs | Agent A2A Protocol |
| Immutability | const / read-only mem | Event Sourcing | Bronze Layer (Raw) | Training data provenance |
| Lazy Evaluation | Generator functions | Stream processing | Spark DAG execution | LLM token streaming |
| Backpressure | TCP flow control | Kafka consumer lag | Dataflow autoscaling | Rate-limited API calls |
| Memoization | CPU cache / L1-L3 | Redis / CDN | BigQuery cache | KV Cache in LLM inference |
| Garbage Collection | free() / RAII | K8s pod eviction | Table expiration policy | Context window management |
THE GREYBEARD'S HYPE FILTER
Pattern recognition from 30+ years of cycles. Every "revolutionary" technology follows the same arc:
THE HYPE CYCLE PATTERN
- Announced at a conference with live demo that works
- VC money floods in
- Every startup rebuilds in the new paradigm
- Production edge cases emerge
- The "boring" version that works ships
- It becomes the new legacy system
ETERNAL TRUTHS (NEVER CHANGE)
- Networking is hard — partial failure is the only failure mode
- Data is the moat — not the model, not the infra
- Humans are the bottleneck — always, forever
- Simple > Clever — the clever solution fails at 3am
- Observability first — if you can't see it, you can't fix it
- The database outlives everything — design schemas with respect
CURRENT HYPE → ACTUAL SIGNAL
- Hype: "LLMs replace all engineers"
Signal: LLMs automate boilerplate, amplify experts - Hype: "Vector DBs solve RAG"
Signal: Hybrid search (BM25 + vectors) wins - Hype: "Agents will be autonomous"
Signal: Human-in-loop wins for now - Hype: "Serverless = no ops"
Signal: Observability becomes harder
THE FDE AI ARSENAL
Your battle-tested toolkit for deploying AI in enterprise hostile environments. Organized by mission objective.
AGENT ORCHESTRATION LAYER
Code-first, model-agnostic. Hierarchy of agents: Planner → Workers → Reviewer. Native A2A protocol. Deploys to Vertex AI Agent Engine.
Production ReadyGCP NativeState machines for agents. Cyclical graphs, human-in-loop checkpoints, persistent state. The greybeard loves this: it's just a graph traversal problem.
Battle-TestedGraph ModelOpen HTTP standard for agent-to-agent communication. Discovery, delegation, status reporting. The REST API pattern for multi-agent systems.
Open StandardAnthropic's standard for connecting LLMs to external tools and data sources. Think USB-C but for AI tool calling — universal, typed, composable.
Emerging StandardHigh PriorityENTERPRISE RAG BLUEPRINT
Enterprise PDFs, scanned documents, tables-in-images. These are the grinders that turn messy real-world documents into clean structured data for RAG.
Ingestion LayerFully managed semantic search over client data. Grounding LLMs against private enterprise knowledge without data leaving GCP. Your first call after data is in BigQuery.
GCP NativeManagedScaNN algorithm, petabyte-scale ANN search. When Pinecone's egress fees become a budget conversation, this is your answer within the GCP perimeter.
GCP NativeSemantic + keyword. Dense retrieval misses exact product codes and industry jargon. BM25 catches "SKU-4X91-B" when vector search returns "that part number thingy."
Critical PatternTHE RAG TRIAD — YOUR EVAL NORTH STAR
GROUNDEDNESS
Does the answer come only from retrieved context?
Hallucination score = 1 - groundedness.
RELEVANCE
Was the right context retrieved? Top-3 hit rate is your KPI before worrying about generation quality.
FAITHFULNESS
Does the generated answer faithfully represent what the context says — nothing added, nothing omitted?
LLM SYSTEMS EVALUATION — INNER + OUTER LOOP
INNER LOOP (DEV TIME)
Tool: ADK eval CLI + Web UI
When: During agent development
tool_trajectory_avg_score— right tools used?response_match_score— ROUGE similarityrubric_based_final_response_quality
OUTER LOOP (PRODUCTION)
Tool: Vertex AI Gen AI Evaluation Service
When: Before any model/prompt update ships
- Pairwise (AutoSxS): Model A vs Model B via LLM judge
- Pointwise: Groundedness, fulfillment, coherence
- Pipeline Eval: Async batch for 10k+ test cases
PRODUCTION OBSERVABILITY
Latency Tracing: LangSmith or Cloud Trace for agent chain-of-thought visualization
Cost Monitoring: Token usage per query × daily query volume = your GCP bill. Set alerts at 80% budget.
Prometheus + Grafana— request/error ratesLoki— log aggregationCloud Trace— distributed tracing
ADVANCED AI DEPLOYMENT PATTERNS
TWO-TIER INFERENCE
For real-time constraints (banking fraud, trading)
Fast deterministic model (<100ms) for the primary decision path. LLM async deep-dive for analyst explanation. Never put an LLM in the hot path of a latency-critical system.
User request
└→ XGBoost / Rule Engine → Decision (50ms)
└→ Gemini/Claude → Explanation (async, 3-5s)
SPECULATIVE DECODING
For inference cost reduction
Small draft model generates candidate tokens. Large verifier model accepts/rejects in parallel. 2-4x throughput gain with identical output quality. Critical for high-volume enterprise deployments.
MIXTURE OF EXPERTS (MOE) ROUTING
For multi-domain enterprise clients
Route queries to specialized agents/models by domain: Legal → legal-finetuned model. Finance → finance-finetuned model. General → frontier model. Cost-efficient + domain-accurate.
HUMAN-IN-LOOP CHECKPOINTS
For regulated industries
Interrupt an agent pipeline at defined decision gates requiring human approval. Critical for healthcare (HIPAA), finance (SOX), and defense. LangGraph's interrupt/resume makes this elegant.
CONSTITUTIONAL AI GUARDRAILS
For enterprise content safety
Apply a second LLM pass to evaluate and filter output before it reaches end users. Define client-specific "constitutions": what is acceptable response content for this industry.
TACTICAL EDGE DEPLOYMENT
For air-gapped / defense clients
ONNX quantized models (.q4). Local Ollama/llama.cpp runtime. Offline vector store. No external API calls. The model weights live on the device. Deploy like a software package, not a service.
DATA + CLOUD ARCHITECTURE WAR ROOM
MEDALLION ARCHITECTURE — BATTLE-TESTED
🥉 BRONZE LAYER
Raw Landing Zone
Immutable. Never transformed. Exactly as received from the source. This is your insurance policy — when the Silver layer breaks, Bronze is your truth.
- GCS bucket with versioning enabled
- Append-only — no deletes, no updates
- Partition by ingestion date, not event date
- Preserve original file format (CSV, JSON, Parquet)
🥈 SILVER LAYER
Single Source of Truth
Cleaned, joined, deduplicated. This is where dbt transformations live. The "Single Source of Truth" that all downstream consumers read.
- BigQuery tables with enforced schemas
- dbt models for transformation lineage
- Data quality checks (Great Expectations / dbt tests)
- SCD Type 2 for slowly changing dimensions
🥇 GOLD LAYER
Business-Ready / AI-Ready
Pre-aggregated, denormalized, optimized for the end consumer. Powers dashboards, APIs, and AI agents. OBT (One Big Table) patterns live here.
- BigQuery clustered + partitioned for query cost
- Feature Store for ML model inputs
- Vector embeddings for RAG retrieval
- Materialized views for dashboard performance
GCP DEPLOYMENT PATTERNS
INTERNET ──→ Cloud Armor (DDoS/WAF)
↓
Identity-Aware Proxy (Zero-Trust)
↓
┌───────────────────────────────────┐
│ SHARED VPC (Hub-Spoke Model) │
│ ┌─────────┐ ┌──────────────┐ │
│ │ GKE │ │ Cloud Run │ │
│ │ Private │ │ Serverless │ │
│ │ Cluster │ │ VPC Connector│ │
│ └─────────┘ └──────────────┘ │
│ ↓ ↓ │
│ ┌─────────────────────────────┐ │
│ │ VPC Service Controls │ │
│ │ (Data Exfil Prevention) │ │
│ └─────────────────────────────┘ │
│ ↓ ↓ │
│ BigQuery Vertex AI │
└───────────────────────────────────┘
↓
Cloud Interconnect → On-Prem Data Center
Client System (On-Prem)
↓ Cloud Interconnect
GCS (Bronze) → Dataflow/dbt → BigQuery (Silver/Gold)
↓
Vertex AI Search (Semantic Index)
↓
ADK Agent Engine
├── Planner Agent (Gemini Pro)
│ ↓ delegates to
├── SQL Coder Agent → BigQuery
├── Doc Researcher Agent → Vertex AI Search
└── Reviewer Agent → validate + format
↓
Cloud Run (API Gateway)
↓
End User / Dashboard
Source Systems (Kafka / REST / Webhook)
↓
Cloud Pub/Sub (Message Bus)
↓
Dataflow (Apache Beam)
├── Schema validation
├── PII masking (DLP)
└── Windowed aggregations
↓
BigQuery (Streaming Insert)
↓
Looker / Dashboard (near real-time)
+
ADK Agent (triggered by Pub/Sub on anomaly)
Hospital EMR System
↓ (encrypted in transit, TLS 1.3)
Cloud Healthcare API (FHIR / HL7 parser)
↓
DLP (Sensitive Data Protection) → PII masked
↓
BigQuery (CMEK encrypted, US-only region)
│
└── VPC Service Controls perimeter
↓
Vertex AI (no internet egress, private endpoints)
↓
Cloud Run (internal only, no public IP)
↓
Clinical Dashboard (IAP protected)
BIGQUERY PERFORMANCE TUNING — THE GREYBEARD'S CHEAT SHEET
PARTITIONING vs CLUSTERING
| Strategy | When to Use |
|---|---|
| Partition by DATE | Time-series queries (most enterprise data) |
| Cluster by column | High-cardinality filter columns (user_id, region) |
| Both | Default: partition date + cluster 2-3 columns |
| Nested/RECORD | Avoid JOINs — denormalize before querying |
QUERY COST KILLERS
- SELECT * on wide tables — always project columns
- No partition filter — always filter on partition column
- CROSS JOINs — usually a data model design failure
- REGEXP on large tables — pre-extract to Silver layer
- Window functions without PARTITION BY — full table scan
TERRAFORM IaC — DEPLOY IN <5 MINUTES
# The FDE's minimum viable GCP environment
module "fde_landing_zone" {
source = "./modules/fde-base"
project_id = var.client_project_id
region = "us-central1"
environment = "prod"
# BigQuery
bq_datasets = ["bronze", "silver", "gold", "ml_features"]
# GKE Private Cluster (Workload Identity enabled)
gke_config = {
autopilot = true # FDE default: less ops overhead
private_nodes = true # No public IPs
min_nodes = 3
max_nodes = 50
}
# VPC Service Controls
vpc_sc_enabled = true
allowed_networks = [module.shared_vpc.network_id]
# IAM (Least Privilege)
fde_service_account_roles = [
"roles/bigquery.dataEditor",
"roles/aiplatform.user",
"roles/storage.objectAdmin"
]
}
THE FDE CONSULTING PLAYBOOK
THE THREE WHYS + THREE REALITIES
Before writing a line of code, interrogate the situation:
The ground truth for each data domain. If it's an Excel file on someone's desktop, that project is already at risk. If it's SAP, prepare for a 6-month data migration. Identifying the SoR in Week 1 saves months of debugging stale/duplicate data later.
If we don't build this, what is the daily/monthly cost to the business? This number is your project's survival mechanism. When budgets get cut, the project with the highest CoI survives.
Who owns this system the day after the FDE leaves? If there is no named internal owner with the skills to maintain it, the system will degrade and the client will blame your product. Build for the handoff from Day 1.
THE TRUSTED ADVISOR FORMULA
| Variable | What It Means for FDEs |
|---|---|
| Credibility | You know what you're talking about. The greybeard has this automatically — but you must demonstrate it within the first meeting. |
| Reliability | You do what you say. Small promises kept consistently > big promises broken once. |
| Intimacy | You understand the client's actual fear (usually: job security, political exposure). Build 1:1 relationships before the architecture review. |
| Self-Orientation | How much are you focused on your own agenda (sell more licenses, look smart) vs. the client's win? This is the DENOMINATOR. Maximize the client's win. |
McKINSEY-GRADE FRAMEWORKS FOR TECHNICAL CHAOS
PYRAMID PRINCIPLE (BLUF)
Bottom Line Up Front
The greybeard pattern: executives want the conclusion, then the data that supports it. Not the journey. Not the technical details.
WRONG: "We analyzed the data and then
ran the pipeline and found some
issues and eventually we think..."
RIGHT: "The migration will be 2 weeks late.
Reason: data quality issues in source.
Fix: 3-day remediation sprint starting Monday."
MECE PRINCIPLE
Mutually Exclusive, Collectively Exhaustive
Break any complex problem into components that: (1) don't overlap and (2) together cover everything. No gaps, no double-counting.
Applied: When scoping a client project, your workstreams should be MECE. "Data Migration" and "ETL Pipeline" are not MECE — migration is part of ETL. Restructure until clean.
80/20 VALUE SCOPING
The 20% of features that deliver 80% of client value. Find these in Week 1. Build these first. Prove value. The remaining 80% of the feature list is negotiable.
THE "FIVE WHY'S" DIAGNOSTIC
Never accept the stated problem as the real problem:
"Our AI model is inaccurate" → Why? Training data is stale → Why? No automated retraining → Why? No MLOps pipeline → Why? No ML engineer on staff → Why? No ML hiring budget ROOT CAUSE: Budget prioritization problem, not a model accuracy problem.
🚩 RED FLAGS — ESCALATE IMMEDIATELY
DATA RED FLAGS
- "Data will be ready in 2 weeks" — add 6 weeks to your estimate
- "We have clean data" — no one has clean data
- "The schema is documented somewhere" — it isn't
- Multiple teams own the same data — political landmine
POLITICAL RED FLAGS
- "We don't need a PM on our side" — project will lose direction
- "The CTO approved this but we haven't told IT" — incoming resistance
- "Can we skip the security review?" — this will come back
- "The previous vendor failed too" — investigate WHY before proceeding
INFRASTRUCTURE RED FLAGS
- "Can we run this on-prem for now?" — deep cloud distrust, investigate root cause
- "We don't have GPU quota" — request lead time: 2-4 weeks minimum
- "Our firewall policy is managed by a different team" — add 2-3 weeks to any connectivity work
BATTLE-TESTED DOCUMENT TEMPLATES
### SITE SURVEY: [CLIENT] - [PROJECT] Date: YYYY-MM-DD | Lead FDE: [Name] ## 1. DATA LANDSCAPE - Source Systems: [SQL Server, SAP, SharePoint, etc.] - Data Volume: [Total + growth rate per day] - Quality Issues: [Missing keys, nulls, encoding issues] - System of Record: [Per domain] ## 2. SECURITY & COMPLIANCE - Data Classification: [PII / PHI / Confidential / Public] - Identity Provider: [Okta / Azure AD / Google] - Connectivity: [Public / VPN / Interconnect / Air-gap] - Compliance Frameworks: [HIPAA / SOC2 / FedRAMP / PCI] ## 3. THE DELTA - What product does out-of-box: [...] - What client needs it to do: [...] - Proposed glue code: [Custom parser / integration / adapter] ## 4. QUICK WIN (Week 2 Objective) - [Stand up X on Y dataset to prove Z metric] ## 5. RISK REGISTER | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------|
### TECHNICAL PRD: [FEATURE NAME] Version: 1.0 | Status: DRAFT | Owner: [FDE Name] ## OBJECTIVE Enable [User Group] to [Action] using [Technology]. ## SUCCESS CRITERIA (MEASURABLE) - Retrieval: >90% Hit Rate, Top-3 documents - Latency: End-to-end < 5 seconds P95 - Groundedness: 0% hallucination on Golden Dataset - Uptime: 99.5% availability SLA ## ARCHITECTURE [Mermaid diagram or ASCII flow] ## PHASED DELIVERY Phase 1 (MVP, Day 30): [Manual trigger, static data] Phase 2 (Scale, Day 60): [Automated, real-time] Phase 3 (Optimize, Day 90): [Cost + performance tuning] ## EXPLICITLY OUT OF SCOPE - [Legacy system X integration] - deferred to Q3 - [Feature Y] - not in SOW
## WEEKLY EXECUTIVE SUMMARY: [PROJECT] Period: [Date Range] | Status: 🟢 GREEN / 🟡 YELLOW / 🔴 RED ## VALUE DELIVERED - [Metric]: Reduced [task] time by [X]% via [solution] - [Milestone]: [What was completed and why it matters] - [Data]: [X] records processed, [Y] cost saved ## RISKS & BLOCKERS (Be Specific) - Risk: [What is at risk] - Impact: [Quantified impact if unresolved] - Action Required: [Name] must [do X] by [Date] ## NEXT 30 DAYS - [Milestone 1]: Complete [X] - [Milestone 2]: Demo [Y] to [stakeholder] - [Handoff]: Training [internal team] on [component]
THE FDE INTERVIEW BLACKBOOK
THE C.A.S.E. FRAMEWORK — NEVER START WITH CODE
C — CLARIFY
Before architecting anything, extract the constraints:
- What is the data volume? (GB, TB, PB?)
- What is the security classification? (PII, PHI, Secret?)
- What is the latency requirement? (<100ms? <5s? Next-day batch?)
- What is the "Definition of Done"?
- What is the client's cloud maturity? (GCP native? Hybrid? Zero cloud?)
A — ARCHITECT
Draw the data flow from source to end-user UI:
- Source → Ingestion → Bronze → Silver → Gold
- Name the GCP service at each step
- Identify where security controls live
- Estimate cost at each layer
S — SOLVE THE DELTA
What does the product NOT do out of the box?
- Custom format parser needed?
- Legacy system API wrapper?
- Data quality circuit breaker?
- Real-time → batch bridge?
This is where you demonstrate actual FDE instinct — the ability to see the gap and immediately propose the glue code.
E — EVALUATE
How do you prove the AI is working?
- What is the hallucination detection strategy?
- What is the Golden Dataset and who owns it?
- What monitoring exists in production?
- How does the system degrade gracefully?
HIGH-FREQUENCY INTERVIEW SCENARIOS
"A hospital chain wants to use AI to predict patient readmission. 20 years of data in on-prem SQL Server. Zero cloud presence. Extreme HIPAA concerns. Walk me through Day 1-30."
- Days 1-7 (Trust + Discovery): Data audit on SQL Server. Map schema to FHIR standard. Meet the CMO to define "readmission" (30d? 90d?). Build trust with IT by showing you understand their HIPAA liability.
- Days 8-15 (Secure Landing Zone): Cloud Healthcare API + DLP masking. BigQuery (CMEK, US-region). VPC Service Controls perimeter. BAA with Google. No raw PHI touches any model.
- Days 16-25 (The Pipeline): Vertex AI Search grounded in patient history. Custom Cloud Run service for real-time vitals from SQL Server (the Delta). Two-tier prediction: XGBoost for fast risk score + Gemini for clinical narrative.
- Days 26-30 (Proof): AutoSxS vs historical outcomes. UAT with 5 doctors. If behavior doesn't change, project has failed.
"Client has 5PB of data on-prem. Emergency exercise requires it in BigQuery in 48 hours. How?"
"A bank wants real-time fraud detection under 100ms using an LLM. How do you architect this?"
"The client's Lead Engineer hates your product and refuses to give VPC access. What do you do?"
"Deploy an LLM-powered intelligence analysis tool on a classified network with zero internet access."
SENIOR vs JUNIOR ANSWER RUBRIC
| Dimension | Junior Answer | Senior FDE Answer |
|---|---|---|
| Security | Not mentioned | First thing discussed, specific controls named |
| Cost | Not mentioned | BigQuery slot cost, API token cost, egress fees estimated |
| Stakeholders | Not mentioned | Named the "Champion" and "Blocker" in Week 1 plan |
| Day 2 | Not mentioned | Named the internal owner and training plan |
| Evaluation | "Test it manually" | Golden dataset, specific metrics, AutoSxS plan |
| The Delta | Uses product out-of-box | Immediately identifies what custom glue code is needed |
| Failure modes | Not considered | Mentions circuit breakers, fallbacks, degraded modes |
FORWARD DEPLOYMENT CHECKLIST
Click items to mark complete. Your battle-tested pre-flight sequence.
WEEK 1 — RECON & TRUST
- Identify the internal "Champion" who owns project success
- Identify the "Blocker" department (IT, Legal, or Politics)
- Define Success Metric — measurable, agreed by executive sponsor
- Complete data audit — source systems, volume, quality issues
- Confirm System of Record for each data domain
- Identify data classification (PII, PHI, Confidential)
- Map compliance requirements (HIPAA, SOC2, FedRAMP, GDPR)
- Understand connectivity constraints (VPN, Interconnect, air-gap)
- Confirm GCP project access level (Editor? Owner?)
- Check GPU quota availability for Vertex AI
- Deliver Site Survey document to stakeholders
WEEKS 2-3 — RAPID BUILD
- Terraform landing zone deployed (GCS, BigQuery, GKE)
- VPC Service Controls perimeter active
- IAM — least privilege service accounts configured
- DLP masking pipeline operational for PII/PHI
- Bronze layer ingestion running and validated
- Silver layer transformations (dbt models) tested
- Gold layer views/tables for AI consumption ready
- Vertex AI Search index populated and searchable
- ADK agent prototype functional on test data
- Golden Dataset (50+ Q&A pairs) created with client
- Observability: Cloud Trace + Logging + Alerting configured
WEEK 4 — PROVE VALUE
- Eval suite passing: groundedness >95%, retrieval >90%
- Security review completed and signed off
- Load test completed (define expected concurrent users)
- Cost estimate validated — actual vs budgeted
- UAT: 5 real end-users have used the system
- User behavior has measurably changed (the actual KPI)
- Internal "Run Team" named and training scheduled
- Runbook documented: how to debug, restart, scale
- Model Monitoring alerts configured for drift
- Executive Status Report delivered with metrics
- SOW Phase 2 / renewal discussion initiated
DATA QUALITY CIRCUIT BREAKERS
The FDE's production insurance policy. Implement these before going live:
# dbt test examples — run in Silver layer before Gold promotion - Freshness: data older than 24h triggers WARNING, older than 48h triggers ERROR - Not Null: critical columns must be non-null (fail pipeline if >0.1% null) - Uniqueness: primary keys must be unique (fail pipeline on any duplicate) - Referential Integrity: all foreign keys must exist in parent table - Range Checks: numeric columns within expected business range - Volume: row count within ±15% of 7-day moving average - Schema: column names and types must match expected schema # If any CRITICAL check fails: # 1. Alert via PagerDuty / Slack # 2. STOP pipeline — do NOT promote bad data to Gold # 3. Keep Bronze intact for forensics # 4. Auto-open incident ticket with failing check detail
THE COMPLETE FDE GLOSSARY
Every term you need to own the room — from boardroom to war room.
FOUNDATIONAL CONCEPTS
The gap between what a product does out-of-the-box and what is required to make it solve a client's specific mission. The FDE's entire job is to close the Delta through custom engineering.
The complex final work of connecting a modern SaaS/AI platform to legacy enterprise systems. Usually 20% of the work but 80% of the project timeline. The greybeard excels here.
Unauthorized tools/databases maintained by individual employees or teams outside the official IT stack. Paradox: this is often where the cleanest and most current data lives.
The authoritative, canonical data source for a given business entity (e.g., SAP = finance SoR, Salesforce = CRM SoR). Never build on data replicas when you can trace to the SoR.
Solving a client's unique problem through code that can eventually be abstracted into a reusable product feature. The FDE's work should always have this secondary ambition.
TECHNICAL TERMS
Environments with zero or intermittent internet connectivity (Defense, Energy, classified). Requires local container registries, offline model weights, and self-contained deployment packages.
GCP security perimeter preventing data exfiltration from managed services (BigQuery, Vertex AI) to unauthorized projects. Mandatory for Finance/Gov/Healthcare deployments.
GCP's gold-standard security pattern — allowing GKE pods/services to act as IAM service accounts without managing JSON key files. If you're using key files in GKE, you're doing it wrong.
Slowly Changing Dimension Type 2 — tracking historical changes in dimension tables by creating new rows with effective/expiry dates. Essential for any client needing point-in-time analytics.
When data is unevenly distributed across partitions, causing some workers to process 10x more data than others. Fix: salting keys, repartitioning, or using AQE (Adaptive Query Execution).
AI / AGENT TERMS
Connecting an LLM to verified, authoritative data sources (via RAG or Search) so its responses are factual and cite-able. Grounding is the primary defense against hallucination in enterprise deployments.
When an LLM generates confident-sounding but factually incorrect output. In enterprise contexts this is not a curiosity — it is a business liability. Grounding + RAGAS faithfulness scores are your defenses.
Key-Value cache of attention computations in transformer inference — the LLM equivalent of CPU L1/L2 cache. Understanding this is how you optimize LLM inference latency and cost at scale.
Reasoning + Acting — the core loop of an LLM agent: (1) Think about what to do, (2) Act using a tool, (3) Observe the result, (4) Repeat until task complete. The basis of all ADK agent behavior.
Using a superior LLM as an "autorater" to compare two model responses. Provides win rates and structured justifications. The gold standard for proving a prompt/model change is an improvement.
CONSULTING TERMS
The legally-binding fence around your project scope. Every feature not in the SOW is "scope creep" and must go through a change order. The FDE's primary defense against an impossible workload.
The quantified cost of NOT deploying the solution. Used to prioritize projects and justify budgets. "Every day we don't have this system costs us $X in analyst labor."
The moment of truth where actual end-users validate the system. If users don't change their behavior based on the output, the project has not succeeded — regardless of technical correctness.
Everything that happens after the FDE leaves: monitoring, model retraining, schema migrations, incident response, user training. Design for Day 2 from Day 1 — otherwise you'll be back on emergency support in 6 months.
Mutually Exclusive, Collectively Exhaustive. A McKinsey framework for structuring problems so that all components are distinct (no overlap) and together cover the entire problem space. Apply to project workstream planning.
THE FDE READING LIST — GREYBEARD EDITION
THE CANON (Non-Negotiable)
The single most important book for an FDE. Explains why every database, streaming system, and distributed architecture works the way it does. The greybeard reads this and confirms what they already experienced by hand.
EssentialTimelessEvery integration pattern you'll encounter in enterprise glue work: Message Bus, Dead Letter Queue, Canonical Data Model, Idempotent Receiver. If Kafka/Pub/Sub frustrates you, read this first.
EssentialLast-Mile IntegrationFDEs fail more often due to broken trust than broken code. This book operationalizes trust as a formula and gives you concrete practices for moving from "vendor" to "strategic partner."
ConsultingEssentialTeaches you to identify the "crux" of a client's actual problem vs. their stated problem. Bad strategy is a list of goals. Good strategy is a diagnosis + guiding policy + coherent actions. Every FDE needs this lens.
StrategyTHE PAPERS (Know Your Ancestry)
The greybeard advantage: understanding where every tool came from.
- The Google File System (2003) — Ancestor of GCS. Understand why append-only and chunk servers exist.
- MapReduce (2004) — Ancestor of Spark. "Embarrassingly parallel" computation on commodity hardware.
- Bigtable (2006) — Ancestor of HBase, Cassandra. Foundation of NoSQL column stores on GCP.
- Dynamo (2007, Amazon) — Eventual consistency, consistent hashing, vector clocks. CAP theorem made real.
- Spanner (2012) — Global distributed SQL with external consistency. How BigQuery achieves petabyte ACID.
- Attention Is All You Need (2017) — The transformer paper. Read the math. A greybeard who understands matrix multiplication understands attention.
- ReAct (2023) — Synergizing Reasoning and Acting in LLMs. The foundation of every ADK agent you'll build.
- Lost in the Middle (2023) — Why LLMs ignore content in the middle of long context windows. Critical for RAG chunking strategy.
SIGNAL PODCASTS
- Latent Space — Best podcast for the AI engineer era. RAG, agents, evals.
- The Cognitive Revolution — Interviews with frontier model builders.
- Practical AI (Changelog) — Production AI, not hype. Engineering-focused.
- The Data Engineering Podcast — Modern data stack updates.
- Software Engineering Daily — Search: "GCP", "Palantir", "distributed systems".
- Hardcore History (Dan Carlin) — For the greybeard: understanding how large organizations actually change under pressure. More relevant to enterprise work than it sounds.
HIGH-SIGNAL NEWSLETTERS
- Import AI (Jack Clark) — AI progress + policy. The weekly digest of what actually matters.
- The Pragmatic Engineer (Gergely Orosz) — How big tech actually ships software. Essential for understanding client engineering cultures.
- Interconnects (Nathan Lambert) — Deep technical LLM training + alignment analysis.
- GCP Weekly — Every meaningful Google Cloud update.
- TLDR Data Engineering — Daily 5-minute data engineering digest.
- The Batch (Andrew Ng, DeepLearning.AI) — Weekly ML/AI practical perspective from a practitioner.
THE GREYBEARD'S SECRET CURRICULUM
Reading that no junior engineer has on their list but every 30-year vet should revisit:
THE CLASSICS (STILL APPLY)
- The Mythical Man-Month (Brooks, 1975) — Adding engineers to a late project makes it later. True in 1975. True today. Every FDE will see this happen.
- Structure and Interpretation of Computer Programs (SICP) — MIT's 1985 textbook that teaches you computation as a language, not a tool.
- The Art of Unix Programming (ESR) — Why small, composable programs beat monoliths. The philosophy behind microservices, before they had that name.
THE META-SKILLS
- How to Read a Paper (Keshav) — The three-pass method. 30 years of papers means you can read faster. Teach this to junior FDEs.
- Clear Thinking (Shane Parrish) — Mental models for decision-making under uncertainty. The consulting mindset in book form.
- Deep Work (Cal Newport) — How to protect focus time in client-site environments full of interruptions.